3 2. Data Readiness Check & Sampling

2024-03-02 8 분 소요

🚩 Chapter 02

목차(Context)

01.Data Info check
02.Data Readiness Check
03.Data Sampling (1)
04.Data Sampling (2)

01. Data Info check

수집된 데이터의 기본 정보들을 확인

Data를 Read하고 데이터의 형태, 타입, 기간 등 전반적인 기본 정보를 파악하는 단계
기본 사항을 확인하며 전처리도 같이 수행함 (1) Data shape(형태) 확인 (2) Data type 확인 (3) Null값 확인 (※ 빈 값의 Data) (4) 중복 데이터 확인 (5) Outlier 확인 (※ 정상적인 범주를 벗어난 Data)

데이터 살펴보기

이커머스 온라인 구매 데이터
데이터 명세 ⬇

InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
송장번호	재고코드	상세설명	수량	송장날짜	개당가격	고객ID	국가

# ▶ Warnings 제거
import warnings
warnings.filterwarnings('ignore')

# ▶ Google drive mount or 폴더 클릭 후 구글드라이브 연결
from google.colab import drive
drive.mount('/content/drive') 

# ▶ 경로 설정 (※ Colab을 활성화시켰다면 보통 Colab Notebooks 폴더가 자동 생성)
import os
os.chdir('/content/drive/MyDrive/Colab Notebooks/00.Fast_campus/04.ML_signature/Part3')

# ▶ 변경된 경로 확인
os.getcwd()

경로 설정은 자기 drive의 파일로

# ▶ pd.set option
import numpy as np
import pandas as pd 
pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',100)  

# ▶ Data read (encoding 정보 확인 필수)
df = pd.read_csv('Part3_Data.csv', encoding='ISO-8859-1')
# df = pd.read_csv('Part3_Data.csv')
df.head()

전달 받은 encoding정보 기입하는 거 필수

# ▶ Data 형태 확인
# ▶ 541,909 row, 8 col로 구성됨
df.shape

54만개의 row, 8개의 col

# ▶ Data type 확인
df.info()

invoiceDate는 문자열에서 시간 형태로 바꿔야 할 듯
custom ID도 문자열로 바꾸는게 나을 듯

# ▶ Null 값 확인
print(df.isnull().sum())

# ▶ Null value drop 필요
# ▶ CustomerID 기준으로 Null value drop
df.dropna(subset=['CustomerID'], how='all', inplace=True)
df.isnull().sum()

custom ID열의 null값 전부 drop, inplace-> 바로 적용
근데 description에 있던 null값도 사라짐-> 아마 custom ID에 포함 되어있었는 듯

# ▶ Outlier 확인 (※Domain 지식 없이 성급하게 Outlier를 제거하는 행위는 위험함)
# ▶ Quantity는 음수 값이 존재할 수 없음
df.describe()

quantity에 음수 값이 있는게 말이 안됨;;

# ▶ distplot 활용 음수 데이터 분포 확인
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use(['dark_background'])

sns.distplot(df['Quantity']);

# ▶ 음수 데이터 값 확인
# ▶ Domain 지식을 다시 한 번 점검해야함 (※ 반품 및 회수 물량일 수 도 있음)
df[df['Quantity']<0]

항상 제거 전에는 현업의 의견 체크는 해야함

# ▶ Quantity 음수값 제거
df = df[df['Quantity']>0]
df.describe()

df를 quantity 양수인 것으로 재정의

# ▶ 중복 데이터 확인
pd.Series(df.duplicated()).value_counts()

df.duplicated: 중복이면 True

# ▶ 중복 데이터 확인, raw data 
df[df.duplicated(keep=False)].sort_values(by=list(df.columns)).head(10)

sort: 정렬
keep= ‘first’, ‘last’, False
- keep='first' (기본값): 첫 번째 행은 False로, 나머지 두 행은 True로 표시됩니다.
- keep='last': 마지막 행은 False로, 첫 번째와 두 번째 행은 True로 표시됩니다.
- keep=False: 모든 세 행은 True로 표시됩니다.

# ▶ 중복 데이터 삭제 후 중복 재확인
df = df.drop_duplicates(keep='first')
pd.Series(df.duplicated()).value_counts()

첫번째 남기고 drop

# ▶ 최종 Data 형태 확인 (541,909 → 392,732)
df.shape

54만개->39만개

02. Data Readiness Check

데이터 수준 사전 점검

현재 가지고 있는 데이터의 수준으로 문제를 해결할 수 있는지 점검 (1) Target label 생성 (2) Target Ratio 확인 (3) 분석 방향성 결정

① Target label 생성

당월 구매 고객이 다음달 재구매 시 해당 고객을 재구매 고객으로 정의 (※ Target 정의) ex) 2011.01월 구매 고객이 다시 2011.2월에 구매 한 고객은 미래에 재구매 고객임

② Target Ratio 확인

1번 단계에서 생성한 Target Ratio의 수준을 확인하고, 문제해결이 가능한 수준인지 판단

③ 분석 방향성 결정

1,2번 단계를 종합하여 분석 방향성을 결정 ex) Target ratio가 낮다면 전체고객이 아닌, 특정 Segmentaion 그룹으로 진행 ex) +1M 구매 타겟은 +2M/+3M까지 확대하여 Target Ratio를 상승

└ Target label 생성

df.head()

from datetime import datetime

# ▶ 기준년월 추출 (bsym)
df['date'] = df['InvoiceDate'].str.split(' ').str[0]
df['date'] = pd.to_datetime(df['date'])
df['bsym'] = df['date'].apply(lambda x: datetime.strftime(x, '%Y-%m'))

# ▶ CustomerID는 str Type으로 변경
df['CustomerID'] = df['CustomerID'].astype(str)
df.head()

date열 추가: 공백 기준 분리해서 월/일/연 값만 가지고 온 후 date로 바꿔줌
bsym열 추가: 기준년월 추가해줘서 월의 정보들을 따로 index 해줌
C.ID는 object로 바꿈

# ▶ 기준년월 추가한 원본 데이터 copy
df_origin = df.copy()

# ▶ 데이터 적재 기간 확인
df['date'].min(), df['date'].max()

# ▶ 기준년월 기준으로 Unique한 CustomerID 추출
# ▶ 매 월 구매한 Unique한 고객을 파악하기 위함
df = df['bsym', 'CustomerID'](https://code7ssage.github.io/'bsym', 'CustomerID'//).drop_duplicates(keep='first')
df.head()

고객을 labeling하기 위해 고객ID 중복 제거

# ▶ Target label 생성 Sample Code (1-loop)
i = '2011-01'

# ▶ 구하고자 하는 기준월 데이터 추출
df_left = df[df['bsym'] == i]

# ▶ 기준년월 +1m
bsym_1m = (pd.to_datetime(i) + pd.DateOffset(months=1)).strftime('%Y-%m')

# ▶ 기준년월 +1m, CustomerID는 1월의 구매한 고객
df_right = pd.DataFrame(df[df['bsym'] == bsym_1m]['CustomerID'].unique())
df_right['target'] = 'Y'
df_right.columns = ['CustomerID', 'target']

# ▶ 기준년월 Data에 기준년월 +1M 고객을 Join하면 Target Data 생성 완료
df_merge = pd.merge(df_left, df_right, how='left')
df_merge['target'] = df_merge['target'].fillna('N')

2011년 1월 정보를 df에서 갖고 옴-> left
bsym_1m: 1개월을 더해줌
먼저, df[‘bsym’] == bsym_1m를 통해 df에서 ‘bsym’ 열의 값이 bsym_1m과 일치하는 행을 필터링
unique: 해당 행들의 ‘CustomerID’ 열에서 고유한 값들을 추출
이 고유한 값들의 배열로 새로운 DataFrame df_right를 생성 targer에 Y값 넣어줌
merge해주고 빈 값에 N넣어줌

# ▶ Target data 생성 확인
df_merge.head()

df_merge['target'].value_counts()

1월-> 2월: 한 260명 정도가 재구매 했다 확인

# ▶ for문 활용 전체 Data Set 대상 Target label 생성
loop_list = list(df['bsym'].unique())
df_all = []

for i in loop_list :
  df_left = df[df['bsym'] == i]
  bsym_1m = (pd.to_datetime(i) + pd.DateOffset(months=1)).strftime('%Y-%m')
  df_right = pd.DataFrame(df[df['bsym'] == bsym_1m]['CustomerID'].unique())
  df_right['target'] = 1
  df_right.columns = ['CustomerID', 'target']
  df_merge = pd.merge(df_left, df_right, how='left')
  df_merge['target'] = df_merge['target'].fillna(0)

  df_all.append(df_merge)

df_all = pd.concat(df_all)

loop로 돌려서 전체 월에 시행

df.shape, df_all.shape

데이터 변동은 없는지 확인-> 열 1개만 늘었음

df_all.head()

└ Target ratio 확인

# ▶ 기준년월(bsym) 기준 Target ratio 확인
df_target = df_all.groupby('bsym')['target'].agg(['sum', 'count']).reset_index()
df_target.columns = ['bsym', 'Y', 'Total']
df_target['ratio'] = df_target['Y']/df_target['Total']
df_target

bsym기준 그룹으로 묶고 target열 선택, 각 그룹에 대해 합(sum), 개수(count) 계산
reset-> 다시 bsym을 열로 돌려놓음
ratio: Y(sum)/ total(count)

# ▶ 현재 가지고 있는 데이터는 2011-12월이 마지막이기 때문에 다음달(2012-01) Target Data를 생성할 수 없음 (※ 분석 대상에서 제외)
df_target = df_target.iloc[0:12, :]
df_target

12월은 제외

# ▶ 전체 Data Set에 Target Ratio 확인
df_target['Y'].sum()/df_target['Total'].sum()

전체 ratio: 37%정도

# ▶ 12월 데이터 분석 대상 제외
df_all = df_all[df_all['bsym'] != '2011-12']

# ▶ 결과적으로, 우리는 Target Ratio가 약 37%되는 문제를 해결하려고 하는 상황임
df_all['target'].sum()/len(df_all)

03. Data Sampling (1)

Sampling 수행 여부 결정

Sampling을 진행하면 효과적으로 데이터 분석 및 모델링을 진행할 수 있음
- 데이터 Mart 생성시간 단축
- 모델 학습 시간 단축
현재 데이터의 상황에 맞는 Sampling 기법을 선정하고 Sampling을 수행
Time series data (1) up-sampling : 주기를 변경할 때, 결측값 채우기 (2) down-sampling : time window 기준 mean(), max(), min()
Non-time series data (1) Target data가 너무 적다면(Class-Imbalanced) → Over-sampling (2) Data가 너무 많다면 Under-sampling ※ Under-sampling시 Stratified sampling(층화추출) 진행

① Over-sampling

대표적 Over-sampling 기법 SMOTE(Synthetic Minority Over-sampling Technique)
낮은 비율로 존재하는 클래스의 데이터를 최근접 이웃 KNN 알고리즘을 활용하여 새롭게 생성하는 방법

② Under-sampling & Stratified sampling (층화추출)

Stratified sampling(층화추출) : 모집단을 동질적인 소집단들로 층화시키고 그 집단의 크기에 따라 단순무작위표본추출방법
- 비례층화추출 : 특정 층의 모집단의 비율과 비례해서 표본을 선정하는 방법
장점
- 단순임의 추출보다 신뢰성이 높은 추정치를 구할 수 있음
- 비용 절감 및 자료분석이 용이함
- 각 층에 대한 추정치를 부수적으로 구할 수 있음
단점
- 방대한 모집단의 경우 목록작성의 어려움
- 적절하게 층을 나누기 위해서는 모집단 각 층에 대한 정확한 정보를 필요

03. Data Sampling (2)

└ Time series : Up-sampling

# ▶ Time series Data 생성
time_index = pd.date_range('2023-01-01', periods=3, freq='10S')
df_time = pd.DataFrame({'value' : [1, 2, 3]}, index = time_index)
df_time

# ▶ Time series 실습 Up-sampling(5S)
df_upsample = df_time.resample('5S').mean()
df_upsample

10초 간격 이였던 데이터를 5초 간격으로 늘림-> 공백 발생

# ▶ ffill
df_upsample.fillna(method='ffill')

# ▶ bfill
df_upsample.fillna(method='bfill')

# ▶ fill(0)
df_upsample.fillna(0)

# ▶ mean()
df_upsample.fillna(df_upsample.mean())

# ▶ linear interpolation
df_upsample.interpolate()

1. ffill: 위의 행의 값 넣어줌
1. bfill: 아래의 행의 값 넣어줌
1. fill(0): 0 넣어줌
1. mean(): 평균
1. interpolate: 위 아래 행의 사이값

└ Time series : Down-sampling

# ▶ Down-sampling 실습
df_upsample.resample('10S').mean()

주기를 5->10초로 늘림, 빈 값은 똑같이 처리

└ Non-ts : SMOTE (Over-sampling)

# ▶ SMOTE 실습
df_all.head()

# ▶ 전체 Data Set에 Target Ratio 확인
df_all['target'].sum()/len(df_all)

df_all['target'].value_counts()

# ▶ Sampling을 하기 위한 X, Y Data split
X = df_all.drop(['bsym', 'target'], axis=1)
Y = df_all['target']

customer ID만 사용
X: C.ID, Y: target

# ▶ SMOTE 활용 Over-sampling
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_over, y_over = smote.fit_resample(X, Y)
print("SMOTE 적용 전 학습용 피처/레이블 데이터 세트 : ", X.shape, Y.shape)
print('SMOTE 적용 후 학습용 피처/레이블 데이터 세트 :', X_over.shape, y_over.shape)
print('SMOTE 적용 후 값의 분포 :\n',pd.Series(y_over).value_counts() )

정답(1)이 4618->7822로 증가

# ▶ target과 non-target data가 5:5
pd.Series(y_over).value_counts(normalize=True) 

비율이 1:1로 같아짐
normalize=True: 비율로 바꿔줌

└ Non-ts : Stratified sampling

# ▶ Stratified sampling  실습
df_all.head()

df_target

# ▶ 층화추출을 위해 각 월 별 차지하는 비중을 계산 
# ▶ target_ratio / total_ratio col 생성
df_target['total_ratio'] = df_target['Total'] / df_target['Total'].sum()
df_target.columns = ['bsym', 'y', 'total_cnt', 'target_ratio', 'total_ratio']
df_target

total ratio: 전체에서 월이 차지하는 비중

# ▶ Total(sum)행 추가
df_target = df_target.append({'bsym': 'total'
          , 'y': df_target['y'].sum()
          , 'total_cnt': df_target['total_cnt'].sum()
          , 'target_ratio' : df_target['y'].sum() / df_target['total_cnt'].sum()
          , 'total_ratio' : 1}, ignore_index=True)
df_target

마지막 행에 전체 sum해주는 거 추가, total ratio는 1
target ratio는 평균으로

# ▶ 전체 Data set에서 30% 비례 층화 추출
len(df_all) * 0.3

3732개의 데이터만 가지고 쓸꺼

# ▶ 30% sampling을 한다고 가정했을 때, 각 월 별 비중과 동일하게 Random sampling 하는 것이 중요 (비례층화추출)
# ▶ sample_n : 비례층화추출을 위한 월 별 추출 개수
df_target['sample_n_total'] = len(df_all) * 0.3
df_target['sample_n'] = df_target['sample_n_total'] * df_target['total_ratio']
df_target['sample_n'] = df_target['sample_n'].astype(int)
df_target

각 월별로 30%씩 추출할꺼

# ▶ 월 별 비중을 유지하면서 Random sampling 진행
rnd_sate = 1234

df_all_00 = df_all[df_all['bsym'] == '2010-12'].sample(n=265, random_state = rnd_sate)
df_all_01 = df_all[df_all['bsym'] == '2011-01'].sample(n=222, random_state = rnd_sate)
df_all_02 = df_all[df_all['bsym'] == '2011-02'].sample(n=227, random_state = rnd_sate)
df_all_03 = df_all[df_all['bsym'] == '2011-03'].sample(n=292, random_state = rnd_sate)
df_all_04 = df_all[df_all['bsym'] == '2011-04'].sample(n=256, random_state = rnd_sate)
df_all_05 = df_all[df_all['bsym'] == '2011-05'].sample(n=316, random_state = rnd_sate)
df_all_06 = df_all[df_all['bsym'] == '2011-06'].sample(n=297, random_state = rnd_sate)
df_all_07 = df_all[df_all['bsym'] == '2011-07'].sample(n=284, random_state = rnd_sate)
df_all_08 = df_all[df_all['bsym'] == '2011-08'].sample(n=280, random_state = rnd_sate)
df_all_09 = df_all[df_all['bsym'] == '2011-09'].sample(n=379, random_state = rnd_sate)
df_all_10 = df_all[df_all['bsym'] == '2011-10'].sample(n=409, random_state = rnd_sate)
df_all_11 = df_all[df_all['bsym'] == '2011-11'].sample(n=499, random_state = rnd_sate)

df_all_sample = pd.concat([df_all_00, df_all_01, df_all_02, df_all_03, df_all_04, df_all_05, df_all_06, df_all_07, df_all_08, df_all_09, df_all_10, df_all_11], axis=0)
df_all_sample

random state고정, 각 월별로 샘플 수가 다르니 각각 입력
마지막에 concat해주면 모아서 볼 수 있음

# ▶ 층화추출이 정상적으로 되었는지, 모수의 target ratio와 비교
df_all_sample_temp = df_all_sample.groupby(['bsym'])['target'].agg(['sum', 'count']).reset_index()
df_all_sample_temp['target_ratio'] = df_all_sample_temp['sum'] / df_all_sample_temp['count']
df_all_sample_temp['target_ratio_mosu'] = df_target['target_ratio']

df_all_sample_temp.style.bar(subset=['target_ratio','target_ratio_mosu'], align='zero', vmin=0, vmax=0.8)

df sample에서 target ratio 계산해서 모수의 ratio와 비교
bar 형태로 시각화 해서 엑셀처럼 표현도 가능

# ▶ sample 전체 Target ratio (※ mosu 37%)
df_all_sample_temp['sum'].sum() / df_all_sample_temp['count'].sum()

샘플의 target ratio는 38%로 ㄱㅊ은듯
근데 더 차이를 줄이고 싶으면 random state를 바꿔가며 조절 가능

df_all_sample.shape 

Twitter Facebook LinkedIn

Lee SeungHyun

3 2. Data Readiness Check & Sampling

🚩 Chapter 02

01. Data Info check

02. Data Readiness Check

└ Target label 생성

└ Target ratio 확인

03. Data Sampling (1)

03. Data Sampling (2)

└ Time series : Up-sampling

└ Time series : Down-sampling

└ Non-ts : SMOTE (Over-sampling)

└ Non-ts : Stratified sampling

공유하기

댓글남기기

참고

코드 유사성 판단 경진 대회- private 3위

25.One-class SVM Code

24.Robust Random Cut Forest Code

23.Isolation Forest Code