18.PCA Code

2024-03-23 2 분 소요

[목적]

Dimensionality Reduction 중 가장 보편적으로 사용되는 PCA Code 실습 진행
Variance를 Maximize 하는 기법
Class Label, Learning Algorithm이 아님

[Process]

Define Data
Modeling
Plotting

  import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.datasets as data

from keras.datasets import mnist
from sklearn.decomposition import PCA

mnist라는 사진 데이터 쓸꺼야

  # MNIST Data Loading
(X_train, Y_train),(X,y) = mnist.load_data()

del X_train
del Y_train

너무 많아서 x,y train은 삭제

# Data Shape 확인
print(">>>> MNIST Data Shape : {}".format(X.shape))
print(">>>> MNIST Label Shape : {}".format(y.shape))

10000,28,28 -> 28 x 28 차원

# Flatten (1, 28, 28) to (1, 784)
X = X.reshape(-1, 28*28)

일자로 쭉 펴주는 작업

# MNIST Visualization
def show_images(num_images):
  if num_images % 10 == 0 and num_images <= 100:
      for digit_num in range(0,num_images): 
          plt.subplot(int(num_images/10),10,digit_num+1) #create subplots
          mat_data = X[digit_num].reshape(28,28)  #reshape images
          plt.imshow(mat_data) #plot the data
          plt.xticks([]) #removes numbered labels on x-axis
          plt.yticks([]) #removes numbered labels on y-axis
show_images(50)

상위 50개만 보여주기

# 특정 숫자 보여주기
def show_images_by_digit(digit_to_see):
  if digit_to_see in list(range(10)):
      indices = np.where(y == digit_to_see) # pull indices for num of interest
      for digit_num in range(0,50): 
          plt.subplot(5,10, digit_num+1) #create subplots
          #reshape images
          mat_data = X[indices[0][digit_num]].reshape(28,28)
          plt.imshow(mat_data) #plot the data
          plt.xticks([]) #removes numbered labels on x-axis
          plt.yticks([]) #removes numbered labels on y-axis
show_images_by_digit(7)

7로 인식 되는 거만 보여주기

[PCA Parameters]

Packge : https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
n_components : 몇 개의 축으로 차원을 축소할 것인가
explained_variance_ : The amount of variance explained by each of the selected components.
- 이게 Eigenvalue
explained_variance_ratio_ : Percentage of variance explained by each of the selected components.

  # PCA Fitting
pca = PCA(n_components = 20)
X_pca = pca.fit_transform(X)
print("PCA Output shape : {}".format(X_pca.shape))

784 -> 20 차원으로 축소
자체적으로 scale해주니 따로 해줄 필요는 없음
```
# EigenValue
print(pca.explained_variance_)
```

Eigenvalue값

# EigenValue Ratio
print(pca.explained_variance_ratio_)

Eigenvalue가 얼마나 잘 나타내는지? 상위 2개 합쳐서 17%.. 3차원? -> 23%

sum(pca.explained_variance_ratio_)

def scree_plot(pca):
  num_components = len(pca.explained_variance_ratio_)
  ind = np.arange(num_components)
  vals = pca.explained_variance_ratio_
 
  plt.figure(figsize=(10, 6))
  ax = plt.subplot(111)
  cumvals = np.cumsum(vals)
  ax.bar(ind, vals)
  ax.plot(ind, cumvals)
  for i in range(num_components):
      ax.annotate(r"%s%%" % ((str(round(vals[i]*100,1))[:3])), (ind[i]+0.2, vals[i]), 
                  va="bottom", 
                  ha="center", 
                  fontsize=8)
 
  ax.xaxis.set_tick_params(width=0)
  ax.yaxis.set_tick_params(width=1, length=6)
 
  ax.set_xlabel("Principal Component")
  ax.set_ylabel("Variance Explained (%)")
  plt.title('Explained Variance Per Principal Component')
scree_plot(pca)

20개 다 해도 65% 정도 밖에 표현 안됨

# Redesign
new_coordinates = np.vstack((X_pca[:,:2].T, y)).T
dataframe = pd.DataFrame(data=new_coordinates, columns=("1st_principal", "2nd_principal", "label"))
print(dataframe.head())

원래 784+1(y) 차원인 것을 1st, 2nd, label(y) 로 표현해줌
X_pca[:,:2].T는 X_pca의 처음 두 열을 취하고, 이들의 행과 열을 전환, t= transpose
vstack: 수직으로 y를 쌓음
다시.T해줌으로 행과 열이 바뀌어, 열이 총 3개가 됨

label(0~9)

# Plotting
sns.FacetGrid(dataframe, hue="label", size=10).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()
plt.show()

Twitter Facebook LinkedIn

Lee SeungHyun

18.PCA Code

공유하기

댓글남기기

참고

코드 유사성 판단 경진 대회- private 3위

25.One-class SVM Code

24.Robust Random Cut Forest Code

23.Isolation Forest Code