homework-jianmu/docs/en/06-advanced/06-tdgpt/05-anomaly-detection/02-statistics-approach.md

3.1 KiB
Raw Blame History

title sidebar_label
Statistical Algorithms Statistical Algorithms
  • k-sigma[1], or 689599.7 rule: The k value defines how many standard deviations indicate an anomaly. The default value is 3. The k-sigma algorithm require data to be in a regular distribution. Data points that lie outside of k standard deviations are considered anomalous.
Parameter Description Required? Default
k Number of standard deviations No 3
--- Use the k-sigma algorithm with a k value of 2
SELECT _WSTART, COUNT(*)
FROM foo
ANOMALY_WINDOW(foo.i32, "algo=ksigma,k=2")
  • Interquartile range (IQR)[2]: IQR divides a rank-ordered dataset into even quartiles, Q1 through Q3. IQR=Q3=Q1, for v, Q1 - (1.5 x IQR) <= v <= Q3 + (1.5 x IQR) is normal. Data points outside this range are considered anomalous. This algorithm does not take any parameters.
--- Use the IQR algorithm.
SELECT _WSTART, COUNT(*)
FROM foo
ANOMALY_WINDOW(foo.i32, "algo=iqr")
  • Grubbs's test[3], or maximum normalized residual test: Grubbs is used to test whether the deviation from mean of the maximum and minimum is anomalous. It requires a univariate data set in a close to normal distribution. Grubbs's test cannot be uses for datasets that are not normally distributed. This algorithm does not take any parameters.
--- Use Grubbs's test.
SELECT _WSTART, COUNT(*)
FROM foo
ANOMALY_WINDOW(foo.i32, "algo=grubbs")
  • Seasonal Hybrid ESD (S-H-ESD)[4]: Extreme Studentized Deviate (ESD) can identify multiple anomalies in time-series data. You define whether to detect positive anomalies (pos), negative anomalies (neg), or both (both). The maximum proportion of data that can be anomalous (max_anoms) is at worst 49.9% Typically, the proportion of anomalies in a dataset does not exceed 5%.
Parameter Description Required? Default
direction Specify the direction of anomalies ('pos', 'neg', or 'both'). No "both"
max_anoms Specify maximum proportion of data that can be anomalous k, where 0 < k <= 49.9 No 0.05
period The number of data points included in each period No 0
--- Use the SHESD algorithm in both direction with a maximum 5% of the data being anomalous
SELECT _WSTART, COUNT(*)
FROM foo
ANOMALY_WINDOW(foo.i32, "algo=shesd,direction=both,anoms=0.05")

The following algorithms are in development:

  • Gaussian Process Regression

Change point detection--based algorithms:

  • CUSUM (Cumulative Sum Control Chart)
  • PELT (Pruned Exact Linear Time)

References

  1. https://en.wikipedia.org/wiki/689599.7 rule
  2. https://en.wikipedia.org/wiki/Interquartile_range
  3. Adikaram, K. K. L. B.; Hussein, M. A.; Effenberger, M.; Becker, T. (2015-01-14). "Data Transformation Technique to Improve the Outlier Detection Power of Grubbs's Test for Data Expected to Follow Linear Relation". Journal of Applied Mathematics. 2015: 19. doi:10.1155/2015/708948.
  4. Hochenbaum, O. S. Vallis, and A. Kejariwal. 2017. Automatic Anomaly Detection in the Cloud Via Statistical Learning. arXiv preprint arXiv:1704.07706 (2017).