15.Spectral Clustering Code
[목적]
- 비지도학습 중 하나인 Clustering 중 Spectral Clustering 실습
- ForLoop 활용 K의 Range를 변경 시켜가며 실습 진행
- Clustering은 기본적으로 Data가 많을 때 시간이 굉장히 오래걸림
- Distance Matrix를 만들고 행렬 계산을 하는 알고리즘이 많기 때문
[Process]
- Define X’s
- Modeling
import gc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
from sklearn.cluster import SpectralClustering
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
- minmax-> scale하나보네
-
PCA-> 차원 축소-> 시각화 필요?
# Data Loading (수술 時 사망 데이터) data=pd.read_csv("https://raw.githubusercontent.com/GonieAhn/Data-Science-online-course-from-gonie/main/Data%20Store/example_data.csv")
# Data Checking
col = []
missing = []
level = []
for name in data.columns:
# Missing
missper = data[name].isnull().sum() / data.shape[0]
missing.append(round(missper, 4))
# Leveling
lel = data[name].dropna()
level.append(len(list(set(lel))))
# Columns
col.append(name)
summary = pd.concat([pd.DataFrame(col, columns=['name']),
pd.DataFrame(missing, columns=['Missing Percentage']),
pd.DataFrame(level, columns=['Level'])], axis=1)
drop_col = summary['name'][(summary['Level'] <= 1) | (summary['Missing Percentage'] >= 0.8)]
data.drop(columns=drop_col, inplace=True)
print(">>>> Data Shape : {}".format(data.shape))
# X's & Y Split
Y = data['censor']
X = data.drop(columns=['censor'])
[Clustering 전 Scaling]
- Clustering은 Distance를 구하는 작업이 필요함
- Feature들의 Scale이 다르면 Distance를 구하는데 가중치가 들어가게 됨
- 따라서, Distance 기반의 Clustering의 경우 Scaling이 필수
# Scaling
scaler = MinMaxScaler().fit(X)
X_scal = scaler.transform(X)
X_scal = pd.DataFrame(X_scal, columns=X.columns)
[차원축소]
- Clustering의 결과를 확인하기 위하여 차원 축소 진행
- 다음 Chapter에서 차원축소에 대해 자세히 설명할 예정
pca = PCA(n_components=2).fit(X)
X_PCA = pca.fit_transform(X)
X_EMM = pd.DataFrame(X_PCA, columns=['AXIS1','AXIS2'])
print(">>>> PCA Variance : {}".format(pca.explained_variance_ratio_))
- 21차원 -> 2차원: 소실이 좀 있을 듯?
[Spectral Clustering]
- Hyperparameter Tuning using for Loop
[Spectral Clustering Parameters]
- Package : https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html
- n_clusters : Cluster 개수 (K)
- affinity : 유사도 행렬 만드는 방법
- ‘nearest_neighbors’: construct the affinity matrix by computing a graph of nearest neighbors.
- ‘rbf’: construct the affinity matrix using a radial basis function (RBF) kernel.
- Gaussian Kernel(디폴트)
- ‘precomputed’: interpret X as a precomputed affinity matrix, where larger values indicate greater similarity between instances.
- ‘precomputed_nearest_neighbors’: interpret X as a sparse graph of precomputed distances, and construct a binary affinity matrix from the n_neighbors nearest neighbors of each instance.
- n_neighbors : 유사도 계산시 주변 n개를 보고 판단할 것 인지
- Number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for affinity=’rbf’(Gaussian일 때는 필요 없다!)
Spectral Clustering
# Spectral Clustering Modeling
for cluster in list(range(2, 6)):
Cluster = SpectralClustering(n_clusters=cluster).fit(X_scal)
labels = Cluster.labels_
# label Add to DataFrame
data['{} label'.format(cluster)] = labels
labels = pd.DataFrame(labels, columns=['labels'])
# Plot Data Setting
plot_data = pd.concat([X_EMM, labels], axis=1)
groups = plot_data.groupby('labels')
mar = ['o', '+', '*', 'D', ',', 'h', '1', '2', '3', '4', 's', '<', '>']
colo = ['red', 'orange', 'green', 'blue', 'cyan', 'magenta', 'black', 'yellow', 'grey', 'orchid', 'lightpink']
fig, ax = plt.subplots(figsize=(10,10))
for j, (name, group) in enumerate(groups):
ax.plot(group['AXIS1'],
group['AXIS2'],
marker=mar[j],
linestyle='',
label=name,
c = colo[j],
ms=4)
ax.legend(fontsize=12, loc='upper right') # legend position
plt.title('Scatter Plot', fontsize=20)
plt.xlabel('AXIS1', fontsize=14)
plt.ylabel('AXIS2', fontsize=14)
plt.show()
print("---------------------------------------------------------------------------------------------------")
gc.collect()
- 2개일 때가 젤 잘 나뉜듯
# Confusion Matrix 확인
cm = confusion_matrix(data['censor'], data['2 label'])
print(cm)
- .. 근데 정확도가 넘 낮음 못 써먹을 듯
# ACC & F1-Score
print("TesT Acc : {}".format((cm[0,0] + cm[1,1])/cm.sum()))
print("F1-Score : {}".format(f1_score(data['censor'], data['2 label'])))
- ACC: 0.45~, F1: 0.29~ -> 데이터랑 분석이 안 맞는듯
댓글남기기