2 분 소요

[목적]

  • Bias를 낮추기위한 Boosting의 초기 모델 AdaBoost 실습 및 해석

[Process]

  1. Define X’s & Y
  2. Split Train & Valid dataset
  3. Modeling
  4. Model 해석
import os
import gc
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.ensemble import AdaBoostClassifier
from collections import Counter
  • 기본적으로 불러오는 친구들
  • 앙상블에 adaboost 불러옴
# Data Loading (수술 時 사망 데이터)
data=pd.read_csv("https://raw.githubusercontent.com/GonieAhn/Data-Science-online-course-from-gonie/main/Data%20Store/example_data.csv")
  • 데이터는 똑같
# X's & Y Split
Y = data['censor']
X = data.drop(columns=['censor'])
  • 열 중 censor만 Y로 갖고옴
idx = list(range(X.shape[0]))
train_idx, valid_idx = train_test_split(idx, test_size=0.3, random_state=2021)
print(">>>> # of Train data : {}".format(len(train_idx)))
print(">>>> # of valid data : {}".format(len(valid_idx)))
print(">>>> # of Train data Y : {}".format(Counter(Y.iloc[train_idx])))
print(">>>> # of valid data Y : {}".format(Counter(Y.iloc[valid_idx])))
  • x,y index 지정
  • counter: 데이터의 imblance 체크

[AdaBoost Parameters]

  • Package : https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
  • n_estimators : # of Tree
  • learning_rate : learning_rate과 n_estimator와 Trade-off 관계가 있음(둘 중 하나가 높으면 하나는 낮아야 함)
  • Weight applied to each classifier at each boosting iteration
# AdaBoost Hyperparameter
estimators = [70, 90, 100]
learning = [0.01, 0.03, 0.05, 0.1, 0.5]

# Modeling
save_est = []
save_lr = []
f1_score_ = []

cnt = 0
for est in estimators:
    for lr in learning:
        print(">>> {} <<<".format(cnt))
        cnt += 1
        print("Number of Estimators : {}, Learning Rate : {}".format(est, lr))
        
        model = AdaBoostClassifier(n_estimators=est, learning_rate=lr, random_state=119)
        model.fit(X.iloc[train_idx], Y.iloc[train_idx])

        # Train Acc
        y_pre_train = model.predict(X.iloc[train_idx])
        cm_train = confusion_matrix(Y.iloc[train_idx], y_pre_train)
        print("Train Confusion Matrix")
        print(cm_train)
        print("Train Acc : {}".format((cm_train[0,0] + cm_train[1,1])/cm_train.sum()))
        print("Train F1-Score : {}".format(f1_score(Y.iloc[train_idx], y_pre_train)))

        # Test Acc
        y_pre_test = model.predict(X.iloc[valid_idx])
        cm_test = confusion_matrix(Y.iloc[valid_idx], y_pre_test)
        print("Test Confusion Matrix")
        print(cm_test)
        print("TesT Acc : {}".format((cm_test[0,0] + cm_test[1,1])/cm_test.sum()))
        print("Test F1-Score : {}".format(f1_score(Y.iloc[valid_idx], y_pre_test)))
        print("-----------------------------------------------------------------------")
        print("-----------------------------------------------------------------------")
        save_est.append(est)
        save_lr.append(lr)
        f1_score_.append(f1_score(Y.iloc[valid_idx], y_pre_test))
  • 은근히 train acc가 낮음-> random forest가 나은 듯…?
print(">>> {} <<<\nBest Test f1-score : {}\nBest n_estimators : {}\nBest Learning Rate : {}".format(np.argmax(f1_score_),
                                                                                                    f1_score_[np.argmax(f1_score_)], 
                                                                                                    save_est[np.argmax(f1_score_)],
                                                                                                    save_lr[np.argmax(f1_score_)]))
  • best model 찾기
  • 마지막에서 좋은 결과? ->estimator 상향 조정

[주의]

  • 현재는 모델을 Parameter Tuning을 하면서 모델을 저정하지 않고 있음
  • 나중에 Training 시간이 매우 오래 걸리는 알고리즘이면 Parameter Tuning을 하면서 모델을 저장해야함
best_model = AdaBoostClassifier(n_estimators=save_est[np.argmax(f1_score_)], learning_rate=save_lr[np.argmax(f1_score_)], random_state=119)
best_model.fit(X.iloc[train_idx], Y.iloc[train_idx])

# Train Acc
y_pre_train = best_model.predict(X.iloc[train_idx])
cm_train = confusion_matrix(Y.iloc[train_idx], y_pre_train)
print("Train Confusion Matrix")
print(cm_train)
print("Train Acc : {}".format((cm_train[0,0] + cm_train[1,1])/cm_train.sum()))
print("Train F1-Score : {}".format(f1_score(Y.iloc[train_idx], y_pre_train)))

# Test Acc
y_pre_test = best_model.predict(X.iloc[valid_idx])
cm_test = confusion_matrix(Y.iloc[valid_idx], y_pre_test)
print("Test Confusion Matrix")
print(cm_test)
print("TesT Acc : {}".format((cm_test[0,0] + cm_test[1,1])/cm_test.sum()))
print("Test F1-Score : {}".format(f1_score(Y.iloc[valid_idx], y_pre_test)))
  • best model로 학습인데, 현업에서는 그때그때 model 저장하는 경우가 많음!
feature_map = pd.DataFrame(sorted(zip(best_model.feature_importances_, X.columns), reverse=True), columns=['Score', 'Feature'])
print(feature_map)
# Importance Score Top 10
feature_map_20 = feature_map.iloc[:10]
plt.figure(figsize=(20, 10))
sns.barplot(x="Score", y="Feature", data=feature_map_20.sort_values(by="Score", ascending=False), errwidth=40)
plt.title('AdaBoost Importance Features')
plt.tight_layout()
plt.show()

- image

카테고리:

업데이트:

댓글남기기