22.LOF Code

2024-03-27 3 분 소요

[목적]

Local Outlier Factor Code 실습
Multivariate variable (다변량)일 때 사용
각 Data마다 Score를 계산하여 Abnormal을 산출 할 수 있음

[Process]

Define Data
Modeling
Plotting

!pip install pyod

import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

from pyod.utils.data import generate_data, get_outliers_inliers
from sklearn.neighbors import LocalOutlierFactor

# Data Loading
X, Y = generate_data(behaviour='new', n_features=10, 
                     train_only=True,
                     contamination=0.1,
                     random_state=2023)

# Naming for columns
col_list = []
for i in range(X.shape[1]):
    a = 'X{}'.format(i+1)
    col_list.append(a)

# Make DF
df = pd.DataFrame(X, columns = col_list)
df['Y'] = Y

# Data 분포 확인하기 X1, X2
sns.scatterplot(x='X1', y='X2', hue='Y', data=df);
plt.title('Ground Truth')

[Local Outlier Factor Parameter]

package : https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html
n_neighbors : 밀도를 계산하고자 하는 주변 관측치 개수
algorithm : [‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’], default=’auto’
- 가까운 이웃을 계산하는데 쓰이는 알고리즘
leaf_size : Tree algorithm에서 leaf node의 개수, default=30
- ‘ball_tree’, ‘KD_Tree’를 사용할때 사용되는 parameter
metric : [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘minkowski’], default=‘minkowski’
p : minkowski를 사용할때 사용되는 param
- When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2.
- 따라서, Metric 쪽을 그냥 가만히 냅두면 Just euclidean_distance
```
# LOF Setup
LOF = LocalOutlierFactor(n_neighbors=10, contamination=0.1)
y_pred = LOF.fit_predict(df[['X1', 'X2']])
```
원래 데이터에선 정상이 0이고 비정상이 1이었음
근데 LOF에서는 정상을 1, 비정상을 -1로 예측함

# LOF Setup
LOF = LocalOutlierFactor(n_neighbors=10, contamination=0.1)
y_pred = LOF.fit_predict(df[['X1', 'X2']])

for i in range(y_pred.shape[0]):
    if y_pred[i] == 1:
        y_pred[i] = 0
    else:
        y_pred[i] = 1

n_errors = (y_pred != df['Y']).sum()
X_scores = LOF.negative_outlier_factor_

그래서 바꿔줌, 1이면 0으로, -1이면 1로
negative_outlier_factor_ : 0 에 가까우면 정상, 더 작은 음수가 될 수록(절댓값이 큰) 이상치일 가능성이 높음, 근데 얘 땜에 X_scores값이 다 음수로 바뀜 그래서 scaling 해줘야 해

n_errors

error 172개임 (좀 많네..?)

radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())

x_scores를 radius로 바꿔줌 그러면 scale되고 다시 양수로 돌아옴

plt.figure(figsize=(13,8))
plt.title("Local Outlier Factor (LOF)")
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], color="k", s=3.0, label="Data points")
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min()) # MinMax Scale
plt.scatter(
    df.iloc[:, 0],
    df.iloc[:, 1],
    s=1000 * radius,
    edgecolors="r",
    facecolors="none",
    label="Outlier scores",
)
plt.axis("tight")
plt.xlim((-10, 10))
plt.ylim((-10, 10))
plt.xlabel("prediction errors: %d" % (n_errors))
legend = plt.legend(loc="upper left")
legend.legendHandles[0]._sizes = [10]
legend.legendHandles[1]._sizes = [20]
plt.show()

s=1000 -> 원의 크기 조절
원의 크기가 작은게 정상, 보니까 LOF는 아까 주황색 원들의 값들도 정상이라고 판단, 오히려 우측 상단 값에서 조금 떨어진 변방의 값들을 abnormal로 판단

plt.figure(figsize=(13,8))
plt.title("Local Outlier Factor (LOF)")
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], color="k", s=3.0, label="Data points")
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min()) # MinMax Scale

for i in range(df.shape[0]):
    if radius[i] >= np.percentile(radius, 95):
            plt.scatter(
            df.iloc[i, 0],
            df.iloc[i, 1],
            s=1000 * radius[i],
            edgecolors="r",
            facecolors="none",
            #label="Outlier scores",
        )
    elif radius[i] < np.percentile(radius, 95):
            plt.scatter(
            df.iloc[i, 0],
            df.iloc[i, 1],
            s=1000 * radius[i],
            edgecolors="b",
            facecolors="none",
            #label="Outlier scores",
        )

plt.axis("tight")
plt.xlim((-10, 10))
plt.ylim((-10, 10))
plt.xlabel("prediction errors: %d" % (n_errors))
legend = plt.legend(loc="upper left")
#legend.legendHandles[0]._sizes = [10]
#legend.legendHandles[1]._sizes = [20]
plt.show()

색깔로 표시해본거
hyperparameter tuning: 99% 정상-> 파란색, 1% 비정상-> 빨간색

plt.figure(figsize=(13,8))
plt.title("Local Outlier Factor (LOF)")
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], color="k", s=3.0, label="Data points")
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min()) # MinMax Scale

for i in range(df.shape[0]):
    if y_pred[i] == 1:
            plt.scatter(
            df.iloc[i, 0],
            df.iloc[i, 1],
            s=1000 * radius[i],
            edgecolors="r",
            facecolors="none",
            #label="Outlier scores",
        )
    elif y_pred[i] == 0:
            plt.scatter(
            df.iloc[i, 0],
            df.iloc[i, 1],
            s=1000 * radius[i],
            edgecolors="b",
            facecolors="none",
            #label="Outlier scores",
        )

plt.axis("tight")
plt.xlim((-10, 10))
plt.ylim((-10, 10))
plt.xlabel("prediction errors: %d" % (n_errors))
legend = plt.legend(loc="upper left")
#legend.legendHandles[0]._sizes = [10]
#legend.legendHandles[1]._sizes = [20]
plt.show()

이건 알고리즘에서 predict 했던 이상값이면 빨간색으로 표시

Twitter Facebook LinkedIn

Lee SeungHyun

22.LOF Code

공유하기

댓글남기기

참고

코드 유사성 판단 경진 대회- private 3위

25.One-class SVM Code

24.Robust Random Cut Forest Code

23.Isolation Forest Code