4 분 소요

[목적]

  • GBM Model을 획기적인 System Design을 활용하여 개선한 XGBoost Model 실습 및 해석
  • XGBoost의 경우 Missing Value를 Model 자체 내에서 처리해주기 때문에 삭제하지 않아도 됨
  • Big Data를 빠르게 학습함

[Process]

  1. Define X’s & Y
  2. Split Train & Valid dataset
  3. Modeling
  4. Model 해석
  import os
import gc
import re
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score
from xgboost import XGBClassifier, XGBRegressor
from collections import Counter
  # Data Loading (수술 時 사망 데이터)
data=pd.read_csv("https://raw.githubusercontent.com/GonieAhn/Data-Science-online-course-from-gonie/main/Data%20Store/example_data.csv")

[Data Condition Check] XGBoost Package의 경우 변수 Name 중 특수 문자가 들어가면 오류가 나게 되어 있음 따라서 변수이름들을 모두 전처리 해줘야함

  # Feature Name Cleaning
regex = re.compile(r"\[|\]|<", re.IGNORECASE)
data.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in data.columns.values]
  • re로 특수문자 무시 (전처리)
  # Data Quality Checking
col = []
missing = []
level = [] 
for name in data.columns:
    
    # Missing
    missper = data[name].isnull().sum() / data.shape[0]
    missing.append(round(missper, 4))

    # Leveling
    lel = data[name].dropna()
    level.append(len(list(set(lel))))

    # Columns
    col.append(name)

summary = pd.concat([pd.DataFrame(col, columns=['name']), 
                     pd.DataFrame(missing, columns=['Missing Percentage']), 
                     pd.DataFrame(level, columns=['Level'])], axis=1)

drop_col = summary['name'][(summary['Level'] <= 1) | (summary['Missing Percentage'] >= 0.8)]
data.drop(columns=drop_col, inplace=True)
print(">>>> Data Shape : {}".format(data.shape))
  • missing value가 얼마나 있는지 체크
  • leveling: missing value 제외한 후, 중복제거 길이 쟀을 때 길이가 1? -> 필요없는 변수
  • drop_col: level<=1 or missing per >= 0.8

    drop_col
    
  • zprior 라는 column 1개 drop
  # X's & Y Split
Y = data['censor']
X = data.drop(columns=['censor'])
  idx = list(range(X.shape[0]))
train_idx, valid_idx = train_test_split(idx, test_size=0.3, random_state=2021)
print(">>>> # of Train data : {}".format(len(train_idx)))
print(">>>> # of valid data : {}".format(len(valid_idx)))
print(">>>> # of Train data Y : {}".format(Counter(Y.iloc[train_idx])))
print(">>>> # of valid data Y : {}".format(Counter(Y.iloc[valid_idx])))

[XGBoost Parameters]

  • Package : https://xgboost.readthedocs.io/en/stable/
  • booster : Iteration 마다의 Model Run Type을 고를수 있음 (2가지)
    • gbtree : tree-based models
    • gblinear : linear models
  • silent : 학습하면서 running message를 프린트해줌 (Parameter 실험 시 안좋음)
    • 0은 프린트 안해주고, 1은 프린트해줌
  • nthread : 병렬처리 할때 core를 몇개 잡을 것인지
    • default로 잡을 수 있는 모든 core를 잡을 수 있도록 해줌
  • learning_rate : GBM에서 shrinking 하는 것과 같은 것
  • reg_lambda : L2 regularization term on weights (Ridge regression랑 비슷)
  • reg_alpha : L1 regularization term on weight (Lasso regression랑 비슷) -> weight주는 거 뿐만 아니라 변수의 개수도 조절!
  • objective [default=reg:linear]
    • This defines the loss function to be minimized. Mostly used values are:
      • binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
      • multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities) you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
      • multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.
      • 간단히 말해서, binary:logistic은 이진 분류 문제에 대해 확률을, multi:softmax는 다중 클래스 분류 문제에 대해 특정 클래스를 예측하는 데 사용
  • eval_metric [ default according to objective ]
    • The metric to be used for validation data.
    • The default values are rmse for regression and error for classification.
    • Typical values are:
      • rmse – root mean square error
      • mae – mean absolute error
      • logloss – negative log-likelihood
      • error – Binary classification error rate (0.5 threshold)
      • merror – Multiclass classification error rate
      • mlogloss – Multiclass logloss
      • auc: Area under the curve [XGBoost]
  • Hyperparameter tuning
  • n_estimators, learning_rate, max_depth, reg_alpha
  • XGBoost은 Hyperparam이 굉장히 많은 알고리즘 중에 하나임
  • 위에 4가지만 잘 조정해도 좋은 결과를 얻을 수 있음
  # n_estimators
n_tree = [5, 10, 20]
# learning_rate
l_rate = [0.1, 0.3]
# max_depth
m_depth = [3, 5]
# reg_alpha
L1_norm = [0.1, 0.3, 0.5]

# Modeling
save_n = []
save_l = []
save_m = []
save_L1 = []
f1_score_ = []

cnt = 0

for n in n_tree:
    for l in l_rate:
        for m in m_depth:
            for L1 in L1_norm:
                
                print(">>> {} <<<".format(cnt))
                cnt +=1
                print("n_estimators : {}, learning_rate : {}, max_depth : {}, reg_alpha : {}".format(n, l, m, L1))
                model = XGBClassifier(n_estimators=n, learning_rate=l, 
                                      max_depth=m, reg_alpha=L1, objective='binary:logistic', random_state=119)
                model.fit(X.iloc[train_idx], Y.iloc[train_idx])
                
                
                # Train Acc
                y_pre_train = model.predict(X.iloc[train_idx])
                cm_train = confusion_matrix(Y.iloc[train_idx], y_pre_train)
                print("Train Confusion Matrix")
                print(cm_train)
                print("Train Acc : {}".format((cm_train[0,0] + cm_train[1,1])/cm_train.sum()))
                print("Train F1-Score : {}".format(f1_score(Y.iloc[train_idx], y_pre_train)))

                # Test Acc
                y_pre_test = model.predict(X.iloc[valid_idx])
                cm_test = confusion_matrix(Y.iloc[valid_idx], y_pre_test)
                print("Test Confusion Matrix")
                print(cm_test)
                print("TesT Acc : {}".format((cm_test[0,0] + cm_test[1,1])/cm_test.sum()))
                print("Test F1-Score : {}".format(f1_score(Y.iloc[valid_idx], y_pre_test)))
                print("-----------------------------------------------------------------------")
                print("-----------------------------------------------------------------------")
                save_n.append(n)
                save_l.append(l)
                save_m.append(m)
                save_L1.append(L1)
                f1_score_.append(f1_score(Y.iloc[valid_idx], y_pre_test))
                
                # Model 저장
                #import joblib
                #joblib.dump(model, './XGBoost_model/Result_{}_{}_{}_{}_{}.pkl'.format(n, l, m, L1, round(f1_score_[-1], 4)))
                #gc.collect()
  • local 환경에 model 저장:
    • n, l, m , L1, f1 score(최근꺼 소수점 4자리까지 반올림) 함께 저장
  print(">>> {} <<<\nBest Test f1-score : {}\nBest n_estimators : {}\nBest Learning Rate : {}\nBest Max_depth : {}\nBest L1-norm : {}".format(np.argmax(f1_score_),
                                                                                                                                            f1_score_[np.argmax(f1_score_)], 
                                                                                                                                            save_n[np.argmax(f1_score_)],
                                                                                                                                            save_l[np.argmax(f1_score_)],
                                                                                                                                            save_m[np.argmax(f1_score_)],
                                                                                                                                            save_L1[np.argmax(f1_score_)]))
  best_model = XGBClassifier(n_estimators=save_n[np.argmax(f1_score_)], learning_rate=save_l[np.argmax(f1_score_)], 
                           max_depth=save_m[np.argmax(f1_score_)], reg_alpha=save_L1[np.argmax(f1_score_)], objective='binary:logistic', 
                           random_state=119)
best_model.fit(X.iloc[train_idx], Y.iloc[train_idx])

# Train Acc
y_pre_train = best_model.predict(X.iloc[train_idx])
cm_train = confusion_matrix(Y.iloc[train_idx], y_pre_train)
print("Train Confusion Matrix")
print(cm_train)
print("Train Acc : {}".format((cm_train[0,0] + cm_train[1,1])/cm_train.sum()))
print("Train F1-Score : {}".format(f1_score(Y.iloc[train_idx], y_pre_train)))

# Test Acc
y_pre_test = best_model.predict(X.iloc[valid_idx])
cm_test = confusion_matrix(Y.iloc[valid_idx], y_pre_test)
print("Test Confusion Matrix")
print(cm_test)
print("TesT Acc : {}".format((cm_test[0,0] + cm_test[1,1])/cm_test.sum()))
print("Test F1-Score : {}".format(f1_score(Y.iloc[valid_idx], y_pre_test)))
  feature_map = pd.DataFrame(sorted(zip(best_model.feature_importances_, X.columns), reverse=True), columns=['Score', 'Feature'])
print(feature_map)
  # Importance Score Top 10
feature_map_10 = feature_map.iloc[:10]
plt.figure(figsize=(20, 10))
sns.barplot(x="Score", y="Feature", data=feature_map_10.sort_values(by="Score", ascending=False), errwidth=40)
plt.title('XGBoost Importance Features')
plt.tight_layout()
plt.show()

image

카테고리:

업데이트:

댓글남기기