Files
team-learning-data-mining/AnomalyDetection/附录.md
2021-05-08 17:28:54 +08:00

5.0 KiB
Raw Blame History

附录

数据集

  1. Outlier Detection DataSets (ODDS) http://odds.cs.stonybrook.edu/#table1
  2. KDD99网络入侵http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

工具

scikit-learn

ensemble.IsolationForestneighbors.LocalOutlierFactor 在此处考虑的数据集上表现良好。的svm.OneClassSVM被称为是对异常值敏感并因此对异常值检测不执行得非常好。话虽如此,在高维中进行离群检测,或者不对基础数据的分布进行任何假设都是非常具有挑战性的。svm.OneClassSVM仍然可以与离群值检测一起使用,但需要对其超参数nu进行微调 以处理离群值并防止过拟合。最后, covariance.EllipticEnvelope假设数据为高斯并学习一个椭圆。有关不同估计量的更多详细信息,请参阅“比较异常检测算法以对玩具数据集进行异常检测的示例” 及其以下部分。

../_images/sphx_glr_plot_anomaly_comparison_0011.png

PyOD

PyOD是一个全面且可扩展的Python工具箱用于检测多元数据中的异常对象。这个令人兴奋而又充满挑战的领域通常称为 异常值检测异常检测

从传统的LOFSIGMOD 2000到最新的COPODICDM 2020PyOD包括30多种检测算法。自2017年以来PyOD [ AZNL19 ]已成功用于众多学术研究和商业产品[ AGSW19ALCJ + 19AWDL + 19AZNHL19 ]。机器学习社区也通过各种专门的帖子/教程对它进行了广泛认可,包括 Analytics Vidhya Towards Data Science KDnuggets Computer Vision Newsawesome-machine-learning.。

代码示例

以下内容来自PyOD官网介绍

  1. 导入模型

    from pyod.models.knn import KNN   # kNN detector
    
  2. 使用 pyod.utils.data.generate_data()生成数据

    contamination = 0.1  # percentage of outliers
    n_train = 200  # number of training points
    n_test = 100  # number of testing points
    
    X_train, y_train, X_test, y_test = generate_data(
        n_train=n_train, n_test=n_test, contamination=contamination)
    
  3. fit and predict

# train kNN detector
clf_name = 'KNN'
clf = KNN()
clf.fit(X_train)

# get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_  # raw outlier scores

# get the prediction on the test data
y_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
y_test_scores = clf.decision_function(X_test)  # outlier scores
  1. 使用ROC和Precision @ Rank n 评估预测

    from pyod.utils.data import evaluate_print
    # evaluate and print the results
    print("\nOn Training Data:")
    evaluate_print(clf_name, y_train, y_train_scores)
    print("\nOn Test Data:")
    evaluate_print(clf_name, y_test, y_test_scores)
    
  2. 观察训练集以及测试集的输出

    On Training Data:
    KNN ROC:1.0, precision @ rank n:1.0
    
    On Test Data:
    KNN ROC:0.9989, precision @ rank n:0.9
    
  3. 可视化

    visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
              y_test_pred, show_figure=True, save_figure=False)
    

kNN demo

详细代码请看官网说明:https://pyod.readthedocs.io/en/latest/pyod.html