2 분 소요

[목적]

  • 비지도학습 중 하나인 Clustering 중 Spectral Clustering 실습
  • ForLoop 활용 K의 Range를 변경 시켜가며 실습 진행
  • Clustering은 기본적으로 Data가 많을 때 시간이 굉장히 오래걸림
    • Distance Matrix를 만들고 행렬 계산을 하는 알고리즘이 많기 때문

[Process]

  1. Define X’s
  2. Modeling
  import gc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

from sklearn.cluster import SpectralClustering
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
  • minmax-> scale하나보네
  • PCA-> 차원 축소-> 시각화 필요?

    # Data Loading (수술 時 사망 데이터)
    data=pd.read_csv("https://raw.githubusercontent.com/GonieAhn/Data-Science-online-course-from-gonie/main/Data%20Store/example_data.csv")
    
  # Data Checking
col = []
missing = []
level = [] 
for name in data.columns:
    
    # Missing
    missper = data[name].isnull().sum() / data.shape[0]
    missing.append(round(missper, 4))

    # Leveling
    lel = data[name].dropna()
    level.append(len(list(set(lel))))

    # Columns
    col.append(name)

summary = pd.concat([pd.DataFrame(col, columns=['name']), 
                     pd.DataFrame(missing, columns=['Missing Percentage']), 
                     pd.DataFrame(level, columns=['Level'])], axis=1)

  drop_col = summary['name'][(summary['Level'] <= 1) | (summary['Missing Percentage'] >= 0.8)]
data.drop(columns=drop_col, inplace=True)
print(">>>> Data Shape : {}".format(data.shape))
  # X's & Y Split
Y = data['censor']
X = data.drop(columns=['censor'])

[Clustering 전 Scaling]

  • Clustering은 Distance를 구하는 작업이 필요함
  • Feature들의 Scale이 다르면 Distance를 구하는데 가중치가 들어가게 됨
  • 따라서, Distance 기반의 Clustering의 경우 Scaling이 필수
  # Scaling
scaler = MinMaxScaler().fit(X)
X_scal = scaler.transform(X)
X_scal = pd.DataFrame(X_scal, columns=X.columns)

[차원축소]

  • Clustering의 결과를 확인하기 위하여 차원 축소 진행
  • 다음 Chapter에서 차원축소에 대해 자세히 설명할 예정
  pca = PCA(n_components=2).fit(X)
X_PCA = pca.fit_transform(X)
X_EMM = pd.DataFrame(X_PCA, columns=['AXIS1','AXIS2'])
print(">>>> PCA Variance : {}".format(pca.explained_variance_ratio_))
  • 21차원 -> 2차원: 소실이 좀 있을 듯?

[Spectral Clustering]

  • Hyperparameter Tuning using for Loop

[Spectral Clustering Parameters]

  • Package : https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html
  • n_clusters : Cluster 개수 (K)
  • affinity : 유사도 행렬 만드는 방법
    • ‘nearest_neighbors’: construct the affinity matrix by computing a graph of nearest neighbors.
    • ‘rbf’: construct the affinity matrix using a radial basis function (RBF) kernel.
      • Gaussian Kernel(디폴트)
    • ‘precomputed’: interpret X as a precomputed affinity matrix, where larger values indicate greater similarity between instances.
    • ‘precomputed_nearest_neighbors’: interpret X as a sparse graph of precomputed distances, and construct a binary affinity matrix from the n_neighbors nearest neighbors of each instance.
  • n_neighbors : 유사도 계산시 주변 n개를 보고 판단할 것 인지
    • Number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for affinity=’rbf’(Gaussian일 때는 필요 없다!)

Spectral Clustering

  # Spectral Clustering Modeling
for cluster in list(range(2, 6)):
    Cluster = SpectralClustering(n_clusters=cluster).fit(X_scal)
    labels = Cluster.labels_

    # label Add to DataFrame
    data['{} label'.format(cluster)] = labels
    labels = pd.DataFrame(labels, columns=['labels'])
    # Plot Data Setting
    plot_data = pd.concat([X_EMM, labels], axis=1)
    groups = plot_data.groupby('labels')

    mar = ['o', '+', '*', 'D', ',', 'h', '1', '2', '3', '4', 's', '<', '>']
    colo = ['red', 'orange', 'green', 'blue', 'cyan', 'magenta', 'black', 'yellow', 'grey', 'orchid', 'lightpink']

    fig, ax = plt.subplots(figsize=(10,10))
    for j, (name, group) in enumerate(groups):
        ax.plot(group['AXIS1'], 
                group['AXIS2'], 
                marker=mar[j],
                linestyle='',
                label=name,
                c = colo[j],
                ms=4)
        ax.legend(fontsize=12, loc='upper right') # legend position
    plt.title('Scatter Plot', fontsize=20)
    plt.xlabel('AXIS1', fontsize=14)
    plt.ylabel('AXIS2', fontsize=14)
    plt.show()
    print("---------------------------------------------------------------------------------------------------")

    gc.collect()

image image image image

  • 2개일 때가 젤 잘 나뉜듯
  # Confusion Matrix 확인
cm = confusion_matrix(data['censor'], data['2 label'])
print(cm)

image

  • .. 근데 정확도가 넘 낮음 못 써먹을 듯
  # ACC & F1-Score
print("TesT Acc : {}".format((cm[0,0] + cm[1,1])/cm.sum()))
print("F1-Score : {}".format(f1_score(data['censor'], data['2 label'])))
  • ACC: 0.45~, F1: 0.29~ -> 데이터랑 분석이 안 맞는듯

카테고리:

업데이트:

댓글남기기