8.Adaboost Code
[목적]
- Bias를 낮추기위한 Boosting의 초기 모델 AdaBoost 실습 및 해석
[Process]
- Define X’s & Y
- Split Train & Valid dataset
- Modeling
- Model 해석
import os
import gc
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.ensemble import AdaBoostClassifier
from collections import Counter
- 기본적으로 불러오는 친구들
- 앙상블에 adaboost 불러옴
# Data Loading (수술 時 사망 데이터)
data=pd.read_csv("https://raw.githubusercontent.com/GonieAhn/Data-Science-online-course-from-gonie/main/Data%20Store/example_data.csv")
- 데이터는 똑같
# X's & Y Split
Y = data['censor']
X = data.drop(columns=['censor'])
- 열 중 censor만 Y로 갖고옴
idx = list(range(X.shape[0]))
train_idx, valid_idx = train_test_split(idx, test_size=0.3, random_state=2021)
print(">>>> # of Train data : {}".format(len(train_idx)))
print(">>>> # of valid data : {}".format(len(valid_idx)))
print(">>>> # of Train data Y : {}".format(Counter(Y.iloc[train_idx])))
print(">>>> # of valid data Y : {}".format(Counter(Y.iloc[valid_idx])))
- x,y index 지정
- counter: 데이터의 imblance 체크
[AdaBoost Parameters]
- Package : https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
- n_estimators : # of Tree
- learning_rate : learning_rate과 n_estimator와 Trade-off 관계가 있음(둘 중 하나가 높으면 하나는 낮아야 함)
- Weight applied to each classifier at each boosting iteration
# AdaBoost Hyperparameter
estimators = [70, 90, 100]
learning = [0.01, 0.03, 0.05, 0.1, 0.5]
# Modeling
save_est = []
save_lr = []
f1_score_ = []
cnt = 0
for est in estimators:
for lr in learning:
print(">>> {} <<<".format(cnt))
cnt += 1
print("Number of Estimators : {}, Learning Rate : {}".format(est, lr))
model = AdaBoostClassifier(n_estimators=est, learning_rate=lr, random_state=119)
model.fit(X.iloc[train_idx], Y.iloc[train_idx])
# Train Acc
y_pre_train = model.predict(X.iloc[train_idx])
cm_train = confusion_matrix(Y.iloc[train_idx], y_pre_train)
print("Train Confusion Matrix")
print(cm_train)
print("Train Acc : {}".format((cm_train[0,0] + cm_train[1,1])/cm_train.sum()))
print("Train F1-Score : {}".format(f1_score(Y.iloc[train_idx], y_pre_train)))
# Test Acc
y_pre_test = model.predict(X.iloc[valid_idx])
cm_test = confusion_matrix(Y.iloc[valid_idx], y_pre_test)
print("Test Confusion Matrix")
print(cm_test)
print("TesT Acc : {}".format((cm_test[0,0] + cm_test[1,1])/cm_test.sum()))
print("Test F1-Score : {}".format(f1_score(Y.iloc[valid_idx], y_pre_test)))
print("-----------------------------------------------------------------------")
print("-----------------------------------------------------------------------")
save_est.append(est)
save_lr.append(lr)
f1_score_.append(f1_score(Y.iloc[valid_idx], y_pre_test))
- 은근히 train acc가 낮음-> random forest가 나은 듯…?
print(">>> {} <<<\nBest Test f1-score : {}\nBest n_estimators : {}\nBest Learning Rate : {}".format(np.argmax(f1_score_),
f1_score_[np.argmax(f1_score_)],
save_est[np.argmax(f1_score_)],
save_lr[np.argmax(f1_score_)]))
- best model 찾기
- 마지막에서 좋은 결과? ->estimator 상향 조정
[주의]
- 현재는 모델을 Parameter Tuning을 하면서 모델을 저정하지 않고 있음
- 나중에 Training 시간이 매우 오래 걸리는 알고리즘이면 Parameter Tuning을 하면서 모델을 저장해야함
best_model = AdaBoostClassifier(n_estimators=save_est[np.argmax(f1_score_)], learning_rate=save_lr[np.argmax(f1_score_)], random_state=119)
best_model.fit(X.iloc[train_idx], Y.iloc[train_idx])
# Train Acc
y_pre_train = best_model.predict(X.iloc[train_idx])
cm_train = confusion_matrix(Y.iloc[train_idx], y_pre_train)
print("Train Confusion Matrix")
print(cm_train)
print("Train Acc : {}".format((cm_train[0,0] + cm_train[1,1])/cm_train.sum()))
print("Train F1-Score : {}".format(f1_score(Y.iloc[train_idx], y_pre_train)))
# Test Acc
y_pre_test = best_model.predict(X.iloc[valid_idx])
cm_test = confusion_matrix(Y.iloc[valid_idx], y_pre_test)
print("Test Confusion Matrix")
print(cm_test)
print("TesT Acc : {}".format((cm_test[0,0] + cm_test[1,1])/cm_test.sum()))
print("Test F1-Score : {}".format(f1_score(Y.iloc[valid_idx], y_pre_test)))
- best model로 학습인데, 현업에서는 그때그때 model 저장하는 경우가 많음!
feature_map = pd.DataFrame(sorted(zip(best_model.feature_importances_, X.columns), reverse=True), columns=['Score', 'Feature'])
print(feature_map)
# Importance Score Top 10
feature_map_20 = feature_map.iloc[:10]
plt.figure(figsize=(20, 10))
sns.barplot(x="Score", y="Feature", data=feature_map_20.sort_values(by="Score", ascending=False), errwidth=40)
plt.title('AdaBoost Importance Features')
plt.tight_layout()
plt.show()
-
댓글남기기