1 분 소요

[목적]

  • Robust Random Cut Forest Code 실습
  • Multivariate variable (다변량)일 때 사용
  • 각 Data마다 Score를 계산하여 Abnormal을 산출 할 수 있음

[Process]

  1. Define Data
  2. Modeling
  3. Plotting
!pip install pyod
!pip install rrcf
import rrcf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

from pyod.utils.data import generate_data, get_outliers_inliers
# Data Loading
X, Y = generate_data(behaviour='new', n_features=10, 
                     train_only=True,
                     contamination=0.1,
                     random_state=2023)
# Naming for columns
col_list = []
for i in range(X.shape[1]):
    a = 'X{}'.format(i+1)
    col_list.append(a)
# Make DF
df = pd.DataFrame(X, columns = col_list)
df['Y'] = Y
# Data 분포 확인하기 X1, X2
sns.scatterplot(x='X1', y='X2', hue='Y', data=df);
plt.title('Ground Truth')

image

[Robust Random Cut Forest Parameter]

  • Introduction page : https://klabum.github.io/rrcf/
  • Code gitgub : https://github.com/kLabUM/rrcf
  • rrcf는 특이하게 parameter를 Code로 짜줘야함…
df_array = np.array(df)

# Set parameters
num_trees = 150
tree_size = 500

# Construct forest
forest = []
while len(forest) < num_trees:
    # Select random subsets of points uniformly from point set
    ixs = np.random.choice(df_array.shape[0], size=(df_array.shape[0] // tree_size, tree_size),
                           replace=False)
    # Add sampled trees to forest
    trees = [rrcf.RCTree(df_array[ix, :2], index_labels=ix) for ix in ixs]
    forest.extend(trees)

# Compute average CoDisp
avg_codisp = pd.Series(0.0, index=np.arange(df_array.shape[0]))
index = np.zeros(df_array.shape[0])
for tree in forest:
    codisp = pd.Series({leaf : tree.codisp(leaf) for leaf in tree.leaves})
    avg_codisp[codisp.index] += codisp 
    np.add.at(index, codisp.index.values, 1)
avg_codisp /= index
  • while len(forest) < num_trees: tree를 직접 만들어 줘야 되네
  • 얘는 negative는 없어서 따로 양수로 안 바꿔줘도 됨
  • 얘는 predict값이 따로 없음
for p in range(80,99):
    plt.figure(figsize=(13,8))
    plt.title("Robust Random Cut Forest\n Cut Percentile : {}".format(p))
    plt.scatter(df.iloc[:, 0], df.iloc[:, 1], color="k", s=3.0, label="Data points")
    # plot circles with radius proportional to the outlier scores
    radius = avg_codisp

    for i in range(df.shape[0]):
        if radius[i] >= np.percentile(radius, p):
                plt.scatter(
                df.iloc[i, 0],
                df.iloc[i, 1],
                s=10 * radius[i],
                edgecolors="r",
                facecolors="none",
                #label="Outlier scores",
            )
        elif radius[i] < np.percentile(radius, p):
                plt.scatter(
                df.iloc[i, 0],
                df.iloc[i, 1],
                s=10 * radius[i],
                edgecolors="b",
                facecolors="none",
                #label="Outlier scores",
            )

    plt.axis("tight")
    plt.xlim((-10, 10))
    plt.ylim((-10, 10))
    #plt.xlabel("prediction errors: %d" % (n_errors))
    legend = plt.legend(loc="upper left")
    #legend.legendHandles[0]._sizes = [10]
    #legend.legendHandles[1]._sizes = [20]
    plt.show()

imageimage

  • abnormal -> 빨간색
  • cut을 80~98까지 늘림
  • 98까지 가니 LOF랑 비슷해짐

카테고리:

업데이트:

댓글남기기