Update 12-tdinsight.md
This commit is contained in:
parent
a1d196688e
commit
217c4c5506
|
@ -171,7 +171,37 @@ Metric details:
|
|||
5. **Writes**: Total number of writes
|
||||
6. **Other**: Total number of other requests
|
||||
|
||||
There are also line charts for the above categories.
|
||||
There are also line charts for the above categories.
|
||||
|
||||
### Automatic import of preconfigured alert rules
|
||||
|
||||
After summarizing user experience, 14 commonly used alert rules are sorted out. These alert rules can monitor key indicators of the TDengine cluster and report alerts, such as abnormal and exceeded indicators.
|
||||
Starting from TDengine-Server 3.3.4.3 (TDengine-datasource 3.6.3), TDengine Datasource supports automatic import of preconfigured alert rules. You can import 14 alert rules to Grafana (version 11 or later) with one click.
|
||||
In the TDengine-datasource setting interface, turn on the "Load Tengine Alert" switch, click the "Save & test" button, the plugin will automatically load the mentioned 14 alert rules. The rules will be placed in the Grafana alerts directory. If not required, turn off the "Load TDengine Alert" switch, and click the button next to "Clear TDengine Alert" to clear all the alert rules imported into this data source.
|
||||
|
||||
After importing, click on "Alert rules" on the left side of the Grafana interface to view all current alert rules. By configuring contact points, users can receive alert notifications.
|
||||
|
||||
The specific configuration of the 14 alert rules is as follows:
|
||||
|
||||
| alert rule| Rule threshold| Behavior when no data | Data scanning interval |Duration | SQL |
|
||||
| ------ | --------- | ---------------- | ----------- |------- |----------------------|
|
||||
|CPU load of dnode node|average > 80%|Trigger alert|5 minutes|5 minutes |`select now(), dnode_id, last(cpu_system) as cup_use from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts < now partition by dnode_id having first(_ts) > 0 `|
|
||||
|Memory of dnode node |average > 60%|Trigger alert|5 minutes|5 minutes|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|
||||
|Disk capacity occupancy of dnode nodes | > 80%|Trigger alert|5 minutes|5 minutes|`select now(), dnode_id, data_dir_level, data_dir_name, last(used) / last(total) * 100 as used from log.taosd_dnodes_data_dirs where _ts >= (now - 5m) and _ts < now partition by dnode_id, data_dir_level, data_dir_name`|
|
||||
|Authorization expires |< 60天|Trigger alert|1 day|0 0 seconds|`select now(), cluster_id, last(grants_expire_time) / 86400 as expire_time from log.taosd_cluster_info where _ts >= (now - 24h) and _ts < now partition by cluster_id having first(_ts) > 0 `|
|
||||
|The used measurement points has reached the authorized number|>= 90%|Trigger alert|1 day|0 seconds|`select now(), cluster_id, CASE WHEN max(grants_timeseries_total) > 0.0 THEN max(grants_timeseries_used) /max(grants_timeseries_total) * 100.0 ELSE 0.0 END AS result from log.taosd_cluster_info where _ts >= (now - 30s) and _ts < now partition by cluster_id having timetruncate(first(_ts), 1m) > 0`|
|
||||
|Number of concurrent query requests | > 100|Do not trigger alert|1 minute|0 seconds|`select now() as ts, count(*) as slow_count from performance_schema.perf_queries`|
|
||||
|Maximum time for slow query execution (no time window) |> 300秒|Do not trigger alert|1 minute|0 seconds|`select now() as ts, count(*) as slow_count from performance_schema.perf_queries where exec_usec>300000000`|
|
||||
|dnode offline |total != alive|Trigger alert|30 seconds|0 seconds|`select now(), cluster_id, last(dnodes_total) - last(dnodes_alive) as dnode_offline from log.taosd_cluster_info where _ts >= (now -30s) and _ts < now partition by cluster_id having first(_ts) > 0`|
|
||||
|vnode offline |total != alive|Trigger alert|30 seconds|0 seconds|`select now(), cluster_id, last(vnodes_total) - last(vnodes_alive) as vnode_offline from log.taosd_cluster_info where _ts >= (now - 30s) and _ts < now partition by cluster_id having first(_ts) > 0 `|
|
||||
|Number of data deletion requests |> 0|Do not trigger alert|30 seconds|0 seconds|``select now(), count(`count`) as `delete_count` from log.taos_sql_req where sql_type = 'delete' and _ts >= (now -30s) and _ts < now``|
|
||||
|Adapter RESTful request fail |> 5|Do not trigger alert|30 seconds|0 seconds|``select now(), sum(`fail`) as `Failed` from log.adapter_requests where req_type=0 and ts >= (now -30s) and ts < now``|
|
||||
|Adapter WebSocket request fail |> 5|Do not trigger alert|30 seconds|0 seconds|``select now(), sum(`fail`) as `Failed` from log.adapter_requests where req_type=1 and ts >= (now -30s) and ts < now``|
|
||||
|Dnode data reporting is missing |< 3|Trigger alert|180 seconds|0 seconds|`select now(), cluster_id, count(*) as dnode_report from log.taosd_cluster_info where _ts >= (now -180s) and _ts < now partition by cluster_id having timetruncate(first(_ts), 1h) > 0`|
|
||||
|Restart dnode |max(update_time) > last(update_time)|Trigger alert|90 seconds|0 seconds|`select now(), dnode_id, max(uptime) - last(uptime) as dnode_restart from log.taosd_dnodes_info where _ts >= (now - 90s) and _ts < now partition by dnode_id`|
|
||||
|
||||
TDengine users can modify and improve these alert rules according to their own business needs. In Grafana 7.5 and below versions, the Dashboard and Alert rules functions are combined, while in subsequent new versions, the two functions are separated. To be compatible with Grafana7.5 and below versions, an Alert Used Only panel has been added to the TDinsight panel, which is only required for Grafana7.5 and below versions.
|
||||
|
||||
|
||||
## Upgrade
|
||||
|
||||
|
|
Loading…
Reference in New Issue