23.Isolation Forest Code

2024-03-27 2 분 소요

[목적]

Isolation Forest Code 실습
Multivariate variable (다변량)일 때 사용
각 Data마다 Score를 계산하여 Abnormal을 산출 할 수 있음

[Process]

Define Data
Modeling
Plotting

!pip install pyod

import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

from pyod.utils.data import generate_data, get_outliers_inliers
from sklearn.ensemble import IsolationForest

isolationforest도 앙상블의 한 종류임

# Data Loading
X, Y = generate_data(behaviour='new', n_features=10, 
                     train_only=True,
                     contamination=0.1,
                     random_state=2023)

# Naming for columns
col_list = []
for i in range(X.shape[1]):
    a = 'X{}'.format(i+1)
    col_list.append(a)

# Make DF
df = pd.DataFrame(X, columns = col_list)
df['Y'] = Y

# Data 분포 확인하기 X1, X2
sns.scatterplot(x='X1', y='X2', hue='Y', data=df);
plt.title('Ground Truth')

[Isolation Forest Parameter]

package : https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
n_estimators : 원하는 기본 estimators 수, default=100
max_samples : 하나의 estimator에 들어가는 sample 수(int or float)
- If int, then draw max_samples samples.
- If float, then draw max_samples * X.shape[0] samples.
- If “auto”, then max_samples=min(256, n_samples).
- default=’auto’
contamination : 데이터 세트 내 이상치 개수 비율(‘auto’ or float)
- default=’auto’
max_features : estimator의 최대 columns 수(int or float), default=1.0
- If int, then draw max_features features.
- If float, then draw max(1, int(max_features * n_features_in_)) features.
- default=1.0, 1.0이면 다 들어가는거 -> 다 들어가도 별 상관은 없을듯?
- int면 개수, float은 %
bootstrap : 데이터 중복(bootstrap)할 것인지 여부(boolean),
- default=False
- 이상치가 안 뽑힐 수 있기 때문에 추천하지 않음

Forest Setup

# Isolation Forest Setup
IF = IsolationForest(n_estimators = 150, max_samples = 500, contamination = 0.1)
IF.fit(df[['X1', 'X2']])
y_pred = IF.predict(df[['X1', 'X2']])

for i in range(y_pred.shape[0]):
    if y_pred[i] == 1:
        y_pred[i] = 0
    else:
        y_pred[i] = 1

n_errors = (y_pred != df['Y']).sum()

tree는 150개, 1000개의 데이터 중 sample수는 500개, 이상치 0.1

n_errors

12개 밖에 안 틀렸네

# Opposite of the anomaly score defined in the original paper.
X_scores = IF.score_samples(df[['X1', 'X2']])

radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())

plt.figure(figsize=(13,8))
plt.title("Isolation Forest")
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], color="k", s=3.0, label="Data points")
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
plt.scatter(
    df.iloc[:, 0],
    df.iloc[:, 1],
    s=10000 * radius**2,
    edgecolors="r",
    facecolors="none",
    label="Outlier scores",
)
plt.axis("tight")
plt.xlim((-10, 10))
plt.ylim((-10, 10))
plt.xlabel("prediction errors: %d" % (n_errors))
legend = plt.legend(loc="upper left")
legend.legendHandles[0]._sizes = [10]
legend.legendHandles[1]._sizes = [20]
plt.show()

이번에는 s=10000으로 줌, 근데 판단이 좀 모호함

plt.figure(figsize=(13,8))
plt.title("Isolation Forest Code")
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], color="k", s=3.0, label="Data points")
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min()) # MinMax Scale

for i in range(df.shape[0]):
    if radius[i] >= np.percentile(radius, 95):
            plt.scatter(
            df.iloc[i, 0],
            df.iloc[i, 1],
            s=1000 * radius[i],
            edgecolors="r",
            facecolors="none",
            #label="Outlier scores",
        )
    elif radius[i] < np.percentile(radius, 95):
            plt.scatter(
            df.iloc[i, 0],
            df.iloc[i, 1],
            s=1000 * radius[i],
            edgecolors="b",
            facecolors="none",
            #label="Outlier scores",
        )

plt.axis("tight")
plt.xlim((-10, 10))
plt.ylim((-10, 10))
plt.xlabel("prediction errors: %d" % (n_errors))
legend = plt.legend(loc="upper left")
#legend.legendHandles[0]._sizes = [10]
#legend.legendHandles[1]._sizes = [20]
plt.show()

이것도 95%-> normal, 5%-> abnormal

plt.figure(figsize=(13,8))
plt.title("Isolation Forest")
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], color="k", s=3.0, label="Data points")
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min()) # MinMax Scale

for i in range(df.shape[0]):
    if y_pred[i] == 1:
            plt.scatter(
            df.iloc[i, 0],
            df.iloc[i, 1],
            s=10000 * radius[i]**2,
            edgecolors="r",
            facecolors="none",
            #label="Outlier scores",
        )
    elif y_pred[i] == 0:
            plt.scatter(
            df.iloc[i, 0],
            df.iloc[i, 1],
            s=1000 * radius[i]**2,
            edgecolors="b",
            facecolors="none",
            #label="Outlier scores",
        )

plt.axis("tight")
plt.xlim((-10, 10))
plt.ylim((-10, 10))
plt.xlabel("prediction errors: %d" % (n_errors))
legend = plt.legend(loc="upper left")
#legend.legendHandles[0]._sizes = [10]
#legend.legendHandles[1]._sizes = [20]
plt.show()

predict값-> 완전 global 하게 봄, LOF와의 차이

from sklearn.inspection import DecisionBoundaryDisplay

disp = DecisionBoundaryDisplay.from_estimator(
    IF,
    df[['X1','X2']],
    response_method="predict",
    alpha=0.5,
)
disp.ax_.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df.iloc[:, -1], s=20, edgecolor="k")
disp.ax_.set_title("Binary decision boundary \nof IsolationForest")
plt.axis("square")
plt.show()

Twitter Facebook LinkedIn

Lee SeungHyun

23.Isolation Forest Code

Forest Setup

공유하기

댓글남기기

참고

코드 유사성 판단 경진 대회- private 3위

25.One-class SVM Code

24.Robust Random Cut Forest Code

22.LOF Code