3 3. Data Mart & Feature Engineering

2024-03-02 6 분 소요

🚩 Chapter 03

목차(Context)

01.Data Mart 기획 및 설계
02.Data 추출 및 Mart 개발
03.Integer Feature Engineering
04.Categoical Feature Engineering

01. Data Mart 기획 및 설계

모델링을 위한 Data Mart 기획

Sampling 완료 후 월 별 기준이 되는 CustomerID를 추출한 상태
Mart를 구성하기 위한 준비가 모두 완료된 상태
가설을 수립하고, 가설에 해당하는 다양한 변수 Category와 List를 생성
풍부한 Data Mart를 기획해야 좋은 모델의 성능과 다양한 해석이 가능

02. Data 추출 및 Mart 개발

Data Mart 기획서 데이터 추출

상위 단계에서 정의한 Data Mart를 기반으로 변수를 추출하는 과정
변수 추출은 Sampling한 Data로 진행

# ▶ 최초 Data에서 bsym(기준년월)을 추가한 origin data
df_origin.head()

아까 복사 해놨던 원본에 bsym붙인거

└ Mart 개발 준비

# ▶ Sampling한 CustomerID 대상 Mart 개발 수행
df_all_sample.head()

샘플링 한게 맞는지 확인? .shape

# ▶ 원본(Origin) data에서 Sampling 하기 위한 key 생성 (1)
df_origin['sample_key'] = df_origin['bsym'].astype(str) + df_origin['CustomerID'].astype(str)
df_origin.head()

원본에 키 형성

# ▶ 원본(Origin)에서 Sampling 하기 위한 key 생성 (2)
df_all_sample['sample_key'] = df_all_sample['bsym'].astype(str) + df_all_sample['CustomerID'].astype(str)
df_all_sample.head()

샘플링한 데이터에도 키 형성

# ▶ sampling 된 CustomerID만 추출 
df_origin_sample = df_origin[df_origin['sample_key'].isin(df_all_sample['sample_key'])]

# ▶ amt col 생성
df_origin_sample['amt'] = df_origin_sample['UnitPrice'] * df_origin_sample['Quantity']
df_origin_sample.head()

isin: 원본에서 샘플에 있는 키만 가지고 옴
amt: 양 X 단위 가격

df_origin.shape, df_origin_sample.shape

39만->11만

└ Mart 추출 시작

# ▶ 변수 생성 시작 (total_amt)
df_mart = df_origin_sample.groupby(['bsym', 'CustomerID'])['amt'].agg(total_amt = ('sum')).reset_index()
df_mart

total_amt = (‘sum’): sum 해준 뒤, total_amt로 열 만들어줌 유용
reset_index는 써주는게 유용

# ▶ 변수 생성 시작 (max_amt, min_amt)
# pd.DataFrame(df_origin_sample.groupby(['bsym', 'CustomerID', 'InvoiceNo'])['amt'].agg(amt = ('sum')).reset_index()).groupby(['bsym', 'CustomerID'])['amt'].agg(max_amt = ('max')).reset_index()
df_mart['max_amt', 'min_amt'](https://code7ssage.github.io/'max_amt', 'min_amt'//) = pd.DataFrame(df_origin_sample.groupby(['bsym', 'CustomerID', 'InvoiceNo'])['amt'].agg(amt = ('sum')).reset_index()).groupby(['bsym', 'CustomerID'])['amt'].agg(max_amt = ('max'), min_amt = ('min')).reset_index()['max_amt', 'min_amt'](https://code7ssage.github.io/'max_amt', 'min_amt'//)
df_mart

bsym, customer ID, invoiceNo 기준으로 그룹화 -> amt=sum -> reset
sym, customer ID 그룹화 -> max, min -> reset ->max, min만 가져옴(대괄호 2개)

# ▶ 변수 생성 시작 (total_cnt)
df_mart['total_cnt'] = pd.DataFrame(df_origin_sample.groupby(['bsym', 'CustomerID'])['InvoiceNo'].agg(total_cnt = ('nunique'))).reset_index()['total_cnt']
df_mart

nunique: unique한 애들만 카운트

# ▶ 변수 생성 시작 (max_cnt, min_cnt)
df_mart['max_cnt', 'min_cnt'](https://code7ssage.github.io/'max_cnt', 'min_cnt'//) = pd.DataFrame(df_origin_sample.groupby(['bsym', 'CustomerID', 'InvoiceNo'])['StockCode'].agg(stock_cnt = 'nunique').reset_index()).groupby(['bsym', 'CustomerID'])['stock_cnt'].agg(max_cnt = ('max'), min_cnt = ('min')).reset_index()['max_cnt', 'min_cnt'](https://code7ssage.github.io/'max_cnt', 'min_cnt'//)
df_mart

# ▶ 변수 생성 시작 (total_qty, max_qty, min_qty)
df_mart['total_qty', 'max_qty', 'min_qty'](https://code7ssage.github.io/'total_qty', 'max_qty', 'min_qty'//) = df_origin_sample.groupby(['bsym', 'CustomerID'])['Quantity'].agg(total_qty = 'sum', max_qty = 'max', min_qty = 'min').reset_index()['total_qty', 'max_qty', 'min_qty'](https://code7ssage.github.io/'total_qty', 'max_qty', 'min_qty'//)
df_mart

# ▶ 변수 생성 시작 (country)
df_mart['Country'] = df_origin_sample.groupby(['bsym', 'CustomerID'])['Country'].agg(country = 'first').reset_index()['country']
df_mart

# ▶ target join
df_mart = pd.merge(df_mart, df_all_sample['bsym', 'CustomerID', 'target'](https://code7ssage.github.io/'bsym', 'CustomerID', 'target'//), how='left', on=['bsym', 'CustomerID'])
df_mart

target 데이터가 없으니 all sample에서 가져옴, 기준: bsym, customer ID

# ▶ 최종 Data set
df_mart.shape

03. Integer Feature Engineering

Feature Engineering이란?

모델링의 목적인 성능향상을 위해 Feature를 생성, 선택, 가공하는 일련의 모든 활동

Target 변수와의 의미있는 변수 선택하는 방법

Regression(회귀) : 상관계수 분석을 통해 유의미한 변수 선택
Classification(분류) : bin(통)으로 구분 후 Target 변수와의 관계 파악

IV (Information Value)

Feature 하나가 Good(Target)과 Bad(Non-target)을 잘 구분해줄 수 있는지에 대한 정보량을 표현하는 방법론
IV 수치가 클수록 Target과 Non-target을 잘 구분할 수 있는 정보량이 많은 Feature이고, IV 수치가 작을수록 정보량이 적은 Feature
(Target data 구성비 - Non-target data 구성비) * WoE
WoE(Weight of Evidence) = ln(Target data 구성비 / Non-target data 구성비)
ln(자연로그)를 취하는 이유? : WoE의 최대, 최소값의 범위를 맞춰주기 위해 (1~∞ , -∞~1)

개념 참조

└ Regression 문제 (corr)

df_mart.head()

# ▶ heatmap plot
import seaborn as sns
# ▶ 모든 조합, 상관계수 표현

df_pair = df_mart['total_amt', 'max_amt', 'min_amt', 'total_cnt','max_cnt', 'min_cnt', 'total_qty', 'max_qty', 'min_qty'](https://code7ssage.github.io/'total_amt', 'max_amt', 'min_amt', 'total_cnt','max_cnt', 'min_cnt', 'total_qty', 'max_qty', 'min_qty'//)
mask = np.triu(np.ones_like(df_pair.corr(), dtype=np.bool))
sns.heatmap(df_pair.corr(), vmin = -1, vmax = +1, annot = True, cmap = 'coolwarm', mask = mask);
plt.gcf().set_size_inches(10, 10)

mask: heatmap에서 중복되는 대칭인 부분 가려주는거

└ Classification 문제 (IV)

# ▶ Binning
total_amt_quantile  = df_mart['total_amt'].quantile(q=list(np.arange(0.05,1,0.05)), interpolation = 'nearest')
# total_amt_quantile

# ▶ 그룹 초기화 
df_mart['total_amt_grp'] = 0

# ▶ 40%, 75% 기준으로 총 3개 구간화 진행 
for i in  range(len(df_mart['total_amt'])) :
  if df_mart.loc[i, "total_amt"] <= total_amt_quantile.iloc[7] : # 40%
    df_mart.loc[i, 'total_amt_grp'] = 1
  elif df_mart.loc[i, "total_amt"] <= total_amt_quantile.iloc[14] : # 75%
    df_mart.loc[i, 'total_amt_grp'] = 2
  else : df_mart.loc[i, 'total_amt_grp'] = 3

quantile: 5%씩 잘라서 데이터 리스트 만들어줌
interpolation = ‘nearest’: 해당 %에 데이터가 없을 때 가장 근접한 데이터 넣어줌
[0]일때가 5%니까, [7]일 때 40%, [14]일 때 75%

df_mart.groupby('total_amt_grp')['target'].agg(['sum', 'count'])

# ▶ non-target labeling
df_mart['n_target'] = np.where(df_mart['target']==0, 1, 0)

# ▶ target , non-target count
iv = df_mart.groupby('total_amt_grp')['target', 'n_target'](https://code7ssage.github.io/'target', 'n_target'//).sum().reset_index()

# ▶ good_pct / bad_pct 계산
iv['good_pct'] = iv['target'] / iv['target'].sum()
iv['bad_pct'] = iv['n_target'] / iv['n_target'].sum()

# ▶ t_ratio
iv['t_ratio'] = iv['target'] / (iv['target'] + iv['n_target'])

iv

n_target: target이 0이면 1을 넣어주고 아니면 0을 넣어줌
good, bad, t 퍼센트 각각 계산

# ▶ Group별 IV 계산
import math

for i in range(len(iv['total_amt_grp'])) :
  iv.loc[i, 'iv'] = (iv.loc[i,'good_pct'] - iv.loc[i,'bad_pct']) * math.log(iv.loc[i,'good_pct']/iv.loc[i,'bad_pct'])

iv

IV 계산하면 1loop 종료

# for문 활용 자동화
integer_list = ['total_amt', 'max_amt', 'min_amt', 'total_cnt','max_cnt', 'min_cnt', 'total_qty', 'max_qty', 'min_qty']
iv_df = []

for val in integer_list : 
    # ▶ Binning
    quantile  = df_mart[val].quantile(q=list(np.arange(0.05,1,0.05)), interpolation = 'nearest')

    # ▶ 그룹 초기화 
    df_mart['grp'] = 0

    # ▶ 40%, 75% 기준으로 총 3개 구간화 진행 
    for i in  range(len(df_mart[val])) :
      if df_mart.loc[i, val] <= quantile.iloc[7] :
        df_mart.loc[i, 'grp'] = 1
      elif df_mart.loc[i, val] <= quantile.iloc[14] :
        df_mart.loc[i, 'grp'] = 2
      else : df_mart.loc[i, 'grp'] = 3

    # ▶ non target labeling
    df_mart['n_target'] = np.where(df_mart['target']==0, 1, 0)

    # ▶ target , non-target count
    iv = df_mart.groupby('grp')['target', 'n_target'](https://code7ssage.github.io/'target', 'n_target'//).sum().reset_index()

    # ▶ good_pct / bad_pct 계산
    iv['good_pct'] = iv['target'] / iv['target'].sum()
    iv['bad_pct'] = iv['n_target'] / iv['n_target'].sum()

    # ▶ t_ratio
    iv['t_ratio'] = iv['target'] / (iv['target'] + iv['n_target'])

    iv['val'] = val

    for i in range(len(iv['grp'])) :
      iv.loc[i, 'iv'] = (iv.loc[i,'good_pct'] - iv.loc[i,'bad_pct']) * math.log(iv.loc[i,'good_pct']/iv.loc[i,'bad_pct'])
  
    iv_df.append(iv)

iv_df=pd.concat(iv_df)
iv_df

total_amt ~min_qty 까지 1,2,3 bin에 대해 IV값 계산

# ▶ Feature별 전체 IV 값을 sum해서 비교
iv_df.groupby('val')['iv'].sum().sort_values(ascending=False)

feature별 IV값 내림차순으로 정리

04. Categoical Feature Engineering

Feature Engineering이란?

모델링의 목적인 성능향상을 위해 Feature를 생성, 선택, 가공하는 일련의 모든 활동

Target 변수와의 의미있는 변수 선택하는 방법

차원이 너무 많은 경우에는 차원을 Domain 지식을 활용하여 축소할 것

IV (Information Value)

Feature 하나가 Good(Target)과 Bad(Non-target)을 잘 구분해줄 수 있는지에 대한 정보량을 표현하는 방법론
IV 수치가 클수록 Target과 Non-target을 잘 구분할 수 있는 정보량이 많은 Feature이고, IV 수치가 작을수록 정보량이 적은 Feature
(Target data 구성비 - Non-target data 구성비) * WoE
WoE(Weight of Evidence) = ln(Target data 구성비 / Non-target data 구성비)
ln(자연로그)를 취하는 이유? : WoE의 최대, 최소값의 범위를 맞춰주기 위해 (1~∞ , -∞~1)

개념 참조

Catplot

Categorical Plot : 색상(hue)과 행(row)등을 동시에 사용하여 3 개 이상의 카테고리 값에 의한 변화를 그려볼 수 있음
Categorical data를 파악하는데 용이함

df_mart.head()

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use(['dark_background'])

# ▶ target 기준(hue,색상)으로 구분하여 Plot
sns.catplot(x="Country", hue="target", kind="count",palette="pastel", edgecolor=".6",data=df_mart);
plt.xticks(rotation=-20)

plt.gcf().set_size_inches(25, 3)

hue: 파랑: 0, 주황: 1

# ▶ EDA value
df_cat = pd.DataFrame(df_mart['Country'].value_counts())
df_cat['ratio'] = df_cat['Country'] / df_cat['Country'].sum()
df_cat

UK: 90% 차지 -> 나머지는 묶어도 상관 없을 듯?

# ▶ UK(영국) 이외는 기타국가로 처리
df_mart['Country'] = np.where(df_mart['Country'] == 'United Kingdom', 'UK', 'ETC')
df_mart['Country'].value_counts()

np.where == (if, true, false)

# ▶ target , non-target count
iv = df_mart.groupby('Country')['target', 'n_target'](https://code7ssage.github.io/'target', 'n_target'//).sum().reset_index()

# ▶ good_pct /bad_pct 계산
iv['good_pct'] = iv['target'] / iv['target'].sum()
iv['bad_pct'] = iv['n_target'] / iv['n_target'].sum()

# ▶ t_ratio
iv['t_ratio'] = iv['target'] / (iv['target'] + iv['n_target'])

iv['val'] = 'Country'

for i in range(len(iv['Country'])) :
  iv.loc[i, 'iv'] = (iv.loc[i,'good_pct'] - iv.loc[i,'bad_pct']) * math.log(iv.loc[i,'good_pct']/iv.loc[i,'bad_pct'])

iv

# ▶ 많은 정보량을 가지고 있지 않음
iv['iv'].sum()

0.000017.. -> 너무 낮음;

# ▶ 최종 Mart 저장
df_mart = pd.DataFrame(df_mart)
df_mart.to_csv('df_mart.csv', index=False)

마트 저장 필수!

▶ OptimalBinning

!pip install optbinning

# ▶ numeric
# !pip install optbinning
from optbinning import OptimalBinning

integer_list = ['total_amt', 'max_amt', 'min_amt', 'total_cnt','max_cnt', 'min_cnt', 'total_qty', 'max_qty', 'min_qty']
iv_df = []

for i in integer_list :
  variable = i
  x = df_mart[variable].values
  y = df_mart.target
  # max_n_prebins
  optb = OptimalBinning(name=variable, dtype="numerical", solver="cp", max_n_prebins=3)
  optb.fit(x, y)
  # print("split points : ", optb.splits)

  binning_table = optb.binning_table
  v1 = binning_table.build()

  df = pd.DataFrame({'val' : variable,
                     'IV' : [v1.loc['Totals','IV']]})
  iv_df.append(df)

iv_df = pd.concat(iv_df).reset_index(drop=True)
iv_df.sort_values(by=['IV'], ascending = False)


# binning_table.plot(metric="event_rate")

x: 열의 데이터 값, y: target
numerical 일때는 cp가 디폴트
max_n_prebins: 몇개의 bin 만들건지
val, IV가 키 인 dict형태로 데이터 프레임 형성

binning_table.plot(metric="event_rate")

event rate가 타겟률
아까랑 bining 기준이 달라져 IV값은 달라짐 주의

# ▶ Categorical
variable_cat = "Country"
x_cat = df_mart[variable_cat].values
y_cat = df_mart.target

optb = OptimalBinning(name=variable_cat, dtype="categorical", solver="mip",
                      cat_cutoff=0.1)

optb.fit(x_cat, y_cat)
optb.splits

binning_table = optb.binning_table
display(binning_table.build())

binning_table.plot(metric="event_rate")

categorical에서는 mip사용
cat_cutoff=0.1: 10%미만의 것은 자동적으로 etc로 묶어줌

Twitter Facebook LinkedIn

Lee SeungHyun

3 3. Data Mart & Feature Engineering

🚩 Chapter 03

01. Data Mart 기획 및 설계

02. Data 추출 및 Mart 개발

└ Mart 개발 준비

└ Mart 추출 시작

03. Integer Feature Engineering

└ Regression 문제 (corr)

└ Classification 문제 (IV)

04. Categoical Feature Engineering

공유하기

댓글남기기

참고

코드 유사성 판단 경진 대회- private 3위

25.One-class SVM Code

24.Robust Random Cut Forest Code

23.Isolation Forest Code