3 5. Modeling tuning & Operation

2024-03-02 4 분 소요

🚩 Chapter 05

목차(Context)

01.Hyper parameter tunning
02.Model explanation
03.Modeling operation

01. Hyper parameter tunning

Parameter vs Hyper-parameter

Parameter : 모델 내부에서 정해지는 변수 ex) 회귀식에서 회귀계수
Hyper-parameter : 최적의 paramter를 도출하기 위해 우리가 직접 세팅하는 값 ex) knn에서 k의 값, Random Forest에서 max_depth의 값 등

대표적 방법론

Grid-Search : 하이퍼 파라미터를 일정한 간격으로 변경하며 최적의 파라미터를 찾아가는 기법, 모든 경우의 수를 탐색
Random Search : 정해진 범위 내에서 Random하게 하이퍼 파라미터를 탐색
BayesianOptimization : 입력값(x)를 Input으로 하는 목적 함수 f(x)를 찾는것, x(hyper-parameter) / f(x)(precision, recall, AUC 등)
순차적으로 하이퍼 파라미터를 업데이트해 가면서 평가를 통해 최적의 하이퍼파라미터 조합을 탐색

# ▶ BayesianOptimization 설치
!pip install bayesian-optimization

└ Random Forest Opt

# ▶ BayesianOptimization
import numpy as np
from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score


def model_evaluate(n_estimators, maxDepth):
    clf = RandomForestClassifier(
        n_estimators= int(n_estimators),
        max_depth= int(maxDepth))
    scores = cross_val_score(clf, x_train, y_train, cv=5, scoring='roc_auc')
    return np.mean(scores)
    
    
def bayesOpt(x_train, y_train):
    clfBO = BayesianOptimization(model_evaluate, {'n_estimators':  (100, 200),
                                                  'maxDepth': (2, 4)
                                                 })
    clfBO.maximize(init_points=5, n_iter=10)
    print(clfBO.res)

bayesOpt(x_train, y_train)

n_estimators, maxdepth 2개에 대해 tuning 진행
cross validation 점수 계산
cv=5: 5번 돌리고 평균 내라
init_points: 시작 포인트
n_iter: 시작 포인트 이후로 몇 번 더 돌건지
분홍 값으로 tuning 해주면 됨: max depth:3, n_est: 196

# ▶ RandomForest
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metrics
from sklearn.metrics import roc_auc_score

rfc = RandomForestClassifier(n_estimators=196, max_depth = 3)
rfc.fit(x_train, y_train)

# ▶ 예측
y_pred_train = rfc.predict(x_train)
y_pred_test = rfc.predict(x_test)

y_pred_train_proba = rfc.predict_proba(x_train)[:, 1] 
y_pred_test_proba = rfc.predict_proba(x_test)[:, 1] 


rfc_re = pd.DataFrame({ 'model' : ['RFC(BO)'],
                      'f1_train' :  metrics.f1_score(y_train,y_pred_train),
                      'f1_test' : metrics.f1_score(y_test,y_pred_test),
                      'AUC_train' : roc_auc_score(y_train, y_pred_train_proba),
                      'AUC_test' : roc_auc_score(y_test, y_pred_test_proba),}
                    )

df_comparison = df_comparison.append(rfc_re)
df_comparison.reset_index(drop=True, inplace = True)

성능도 올라가고 overfitting도 잡힘

df_comparison

└ LightGBM Opt

def lgb_evaluate(n_estimators, maxDepth):
    clf = LGBMClassifier(
        objective = 'binary',
        metric= 'auc',
        learning_rate=0.01,
        n_estimators= int(n_estimators),
        max_depth= int(maxDepth))
    scores = cross_val_score(clf, x_train, y_train, cv=5, scoring='roc_auc')
    return np.mean(scores)
    
def bayesOpt(train_x, train_y):
    lgbBO = BayesianOptimization(lgb_evaluate, {                                               
                                                'n_estimators': (100, 200),
                                                'maxDepth': (2, 4)  
                                               })
    lgbBO.maximize(init_points=5, n_iter=10)
    print(lgbBO.res)

bayesOpt(x_train, y_train)

objective, metric, learing_rate는 지정 해주는 것이 좋음

# ▶ lightGBM
from lightgbm import LGBMClassifier

# ▶ setting the parameters
LGBM = LGBMClassifier(n_estimators=100, max_depth=2)
LGBM.fit(x_train, y_train)


# ▶ 예측
y_pred_train = LGBM.predict(x_train)
y_pred_test = LGBM.predict(x_test)

y_pred_train_proba = LGBM.predict_proba(x_train)[:, 1] 
y_pred_test_proba = LGBM.predict_proba(x_test)[:, 1] 

lgbm_re = pd.DataFrame({ 'model' : ['LGBM(BO)3'],
                      'f1_train' :  metrics.f1_score(y_train,y_pred_train),
                      'f1_test' : metrics.f1_score(y_test,y_pred_test),
                      'AUC_train' : roc_auc_score(y_train, y_pred_train_proba),
                      'AUC_test' : roc_auc_score(y_test, y_pred_test_proba),}
                    )

df_comparison = df_comparison.append(lgbm_re)
df_comparison.reset_index(drop=True, inplace = True)

n_est:163, max depth:2

df_comparison

overfitting이 생각보다는 잘 안 잡힘..

df_comparison.style.background_gradient(cmap='coolwarm', low=1)

▶💯 Best Model & Hyper parameter > LGBMClassifier(n_estimators=100, max_depth=2)

02. Model explanation

Permutation Importance

Feature와 실제 결과값 간의 관계(연결고리)를 끊어내도록 특성들의 값을 랜덤하게 섞은 후 모델 예측치의 오류 증가량을 측정,
만약 하나의 특성을 무작위로(shuffle) 섞었을 때 모델 오류가 증가한다면 모델이 예측할 때 해당 특성에 의존한다는 것을 의미
모델의 성능이 떨어지면, 중요한 Feature
모델의 성능이 그대로이거나, 좋아지면 중요하지 않은 Feature

Shapley Value

특정 Feature가 예측값에 얼만가 기여하는지 파악하기 위해 특정 변수와 관련된 모든 변수 조합들을 입력시켰을 때 나온 결과값과 비교를 하면서 변수의 기여도를 계산하는 방식, 즉 특정 Feature(on/off)가 예측값에 얼마나 영향을 끼쳤는지 탐색

# ▶ 설치
!pip install eli5

└ Permutation Importance

import eli5
from eli5.sklearn import PermutationImportance

# ▶ LR Model
perm = PermutationImportance(LR, random_state=1).fit(x_train_sc, y_train)
eli5.show_weights(perm, feature_names = x_train.columns.tolist())

random state 고정 필수

# ▶ RFC Model
perm = PermutationImportance(rfc, random_state=1).fit(x_train, y_train)
eli5.show_weights(perm, top = 11, feature_names = x_train.columns.tolist())

# ▶ LGBM Model
perm = PermutationImportance(LGBM, random_state=3).fit(x_train, y_train)
eli5.show_weights(perm, top = 11, feature_names = x_train.columns.tolist())

total_amt,cnt가 중요 feature로 판단

└ SHAP

# ▶ shap
!pip install shap

import shap
shap.initjs()
# ▶ LR shape
explainer = shap.LinearExplainer(LR.fit(x_train_sc, y_train), x_train_sc)
shap_values = explainer.shap_values(x_test_sc)
shap.summary_plot(shap_values, x_test_sc, feature_names=x_test.columns, plot_type="bar",class_names=y_test)

LR의 경우 linearexplainer꼭 넣어줘야 함

# ▶ LR shap
shap.summary_plot(shap_values, x_test_sc, feature_names=x_test.columns)

# ▶ LR shap
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0], features = x_test_sc[0], feature_names = x_test.columns, link='logit')

각 객체에 대한 영향도 확인 해 볼 수 있음
link=’logit’은 꼭 넣어줘야 함 주의

import shap
shap.initjs()
# ▶ RFC shap
explainer = shap.TreeExplainer(rfc)
shap_values = explainer.shap_values(x_test)
shap.summary_plot(shap_values, x_test, feature_names=x_test.columns, plot_type="bar")

# ▶ RFC shap
shap.summary_plot(shap_values[1], x_test)

# ▶ RFC shap
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1][0,:], feature_names = x_test.columns, link='logit')

import shap
shap.initjs()
# ▶ LGBM shap
explainer = shap.TreeExplainer(LGBM)
shap_values = explainer.shap_values(x_test)
shap.summary_plot(shap_values, x_test, feature_names=x_test.columns, plot_type="bar")

# ▶ LGBM shap
shap.summary_plot(shap_values[1], x_test)

# ▶ RFC shap
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1][4,:] , x_test.iloc[4,:], link='logit')

import shap
shap.initjs()
# ▶ LGBM shap
explainer = shap.TreeExplainer(LGBM)
shap_values = explainer.shap_values(x_test)
shap.summary_plot(shap_values, x_test, feature_names=x_test.columns, plot_type="bar")

# ▶ LGBM shap
shap.summary_plot(shap_values[1], x_test)

# ▶ RFC shap
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1][4,:] , x_test.iloc[4,:], link='logit')

03. Model operation

최종 모델 활용 운영 준비

Target 고객 수준 결정 : 최종 Selection된 Model 기준으로 재구매 가능성이 높은 Target 고객군 선정
Model Save : Model operation을 위해 모델 및 하이퍼파라미터 저장 후 운영시 Model 및 하이퍼파라미터 load
Mart 생성 : Mart를 생성할 때 사용했던 전처리 및 완료 코드를 저장 후 운영시 불러와 월 별 새로운 Mart 생성
Model Input : 생성된 Mart를 Model에 Input하여 결과를 추출 (※ proba 예측확률)
Target 추출 : 1 단계에서 선정한 threshold 기준으로 Target 고객 추출

# ▶ model LGBM
df_comparison

best model은 LGBM(BO)3

# ▶ model LGBM
LGBM = LGBMClassifier(n_estimators=100, max_depth=2)
LGBM.fit(x_train, y_train)

# ▶ 예측
y_pred_train = LGBM.predict(x_train)
y_pred_test = LGBM.predict(x_test)

y_pred_train_proba = LGBM.predict_proba(x_train)[:, 1]
y_pred_test_proba = LGBM.predict_proba(x_test)[:, 1]

best model 적용

# ▶ train target
import numpy as np

bins=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
lift_base = y_train.value_counts(normalize=True)[1]

# ▶ bin 고정 scoring /  [ : 포함, ) : 포함X
confusion_matrix1 = pd.crosstab(pd.cut(y_pred_train_proba, bins, right=False), y_train , rownames=['Predicted'], colnames=['Actual'], margins=True)

# confusion_matrix1 = pd.crosstab(pd.qcut(y_pred_train_proba, 10), y_train , rownames=['Predicted'], colnames=['Actual'], margins=True)
confusion_matrix1['ratio']=round((pd.DataFrame(confusion_matrix1)[1]/pd.DataFrame(confusion_matrix1)['All']),2)
confusion_matrix1['Lift']=round(confusion_matrix1['ratio']/lift_base,1)
confusion_matrix1

예측 확률을 10등분으로 나눠 가지고 거기에 해당하는 타겟 고객 군들의 지표 확인
재구매율 정확성 확인: lift base -> train data에서의 재구매율 가져옴(38%)
pd.crosstab(pd.qcut(y_pred_train_proba, 10)-> 얘는 알아서 적절히 10등분 해줌 둘 중 하나 골라 씀

# ▶ test target
# bins=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
bins=[0.0, 0.5, 1.0]
lift_base = y_test.value_counts(normalize=True)[1]

confusion_matrix1 = pd.crosstab(pd.cut(y_pred_test_proba, bins, right=False), y_test , rownames=['Predicted'], colnames=['Actual'], margins=True)
confusion_matrix1['ratio']=round((pd.DataFrame(confusion_matrix1)[1]/pd.DataFrame(confusion_matrix1)['All']),2)
confusion_matrix1['Lift']=round(confusion_matrix1['ratio']/lift_base,1)
confusion_matrix1

bin을 저렇게 2등분하면 0.5 이하, 초과 이등분 형태로 보기 좋음

import pickle
# ▶ 모델 저장
saved_model = pickle.dumps(LGBM)

# ▶ 모델 Read
LGBM = pickle.loads(saved_model)

———————–EOD—————————

Twitter Facebook LinkedIn

Lee SeungHyun

3 5. Modeling tuning & Operation

🚩 Chapter 05

01. Hyper parameter tunning

└ Random Forest Opt

└ LightGBM Opt

02. Model explanation

└ Permutation Importance

└ SHAP

03. Model operation

공유하기

댓글남기기

참고

코드 유사성 판단 경진 대회- private 3위

25.One-class SVM Code

24.Robust Random Cut Forest Code

23.Isolation Forest Code