2 분 소요

[목적]

  • Dimensionality Reduction 중 가장 보편적으로 사용되는 PCA Code 실습 진행
  • Variance를 Maximize 하는 기법
  • Class Label, Learning Algorithm이 아님

[Process]

  1. Define Data
  2. Modeling
  3. Plotting
  import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.datasets as data

from keras.datasets import mnist
from sklearn.decomposition import PCA
  • mnist라는 사진 데이터 쓸꺼야
  # MNIST Data Loading
(X_train, Y_train),(X,y) = mnist.load_data()

del X_train
del Y_train
  • 너무 많아서 x,y train은 삭제

    # Data Shape 확인
    print(">>>> MNIST Data Shape : {}".format(X.shape))
    print(">>>> MNIST Label Shape : {}".format(y.shape))
    
  • 10000,28,28 -> 28 x 28 차원

    # Flatten (1, 28, 28) to (1, 784)
    X = X.reshape(-1, 28*28)
    
  • 일자로 쭉 펴주는 작업

    # MNIST Visualization
    def show_images(num_images):
      if num_images % 10 == 0 and num_images <= 100:
          for digit_num in range(0,num_images): 
              plt.subplot(int(num_images/10),10,digit_num+1) #create subplots
              mat_data = X[digit_num].reshape(28,28)  #reshape images
              plt.imshow(mat_data) #plot the data
              plt.xticks([]) #removes numbered labels on x-axis
              plt.yticks([]) #removes numbered labels on y-axis
    show_images(50)
    

    image

  • 상위 50개만 보여주기

    # 특정 숫자 보여주기
    def show_images_by_digit(digit_to_see):
      if digit_to_see in list(range(10)):
          indices = np.where(y == digit_to_see) # pull indices for num of interest
          for digit_num in range(0,50): 
              plt.subplot(5,10, digit_num+1) #create subplots
              #reshape images
              mat_data = X[indices[0][digit_num]].reshape(28,28)
              plt.imshow(mat_data) #plot the data
              plt.xticks([]) #removes numbered labels on x-axis
              plt.yticks([]) #removes numbered labels on y-axis
    show_images_by_digit(7)
    

    image

  • 7로 인식 되는 거만 보여주기

[PCA Parameters]

  • Packge : https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
  • n_components : 몇 개의 축으로 차원을 축소할 것인가
  • explained_variance_ : The amount of variance explained by each of the selected components.
    • 이게 Eigenvalue
  • explained_variance_ratio_ : Percentage of variance explained by each of the selected components.
  # PCA Fitting
pca = PCA(n_components = 20)
X_pca = pca.fit_transform(X)
print("PCA Output shape : {}".format(X_pca.shape))
  • 784 -> 20 차원으로 축소
  • 자체적으로 scale해주니 따로 해줄 필요는 없음

    # EigenValue
    print(pca.explained_variance_)
    
  • Eigenvalue값

    # EigenValue Ratio
    print(pca.explained_variance_ratio_)
    
  • Eigenvalue가 얼마나 잘 나타내는지? 상위 2개 합쳐서 17%.. 3차원? -> 23%

    sum(pca.explained_variance_ratio_)
    
    def scree_plot(pca):
      num_components = len(pca.explained_variance_ratio_)
      ind = np.arange(num_components)
      vals = pca.explained_variance_ratio_
     
      plt.figure(figsize=(10, 6))
      ax = plt.subplot(111)
      cumvals = np.cumsum(vals)
      ax.bar(ind, vals)
      ax.plot(ind, cumvals)
      for i in range(num_components):
          ax.annotate(r"%s%%" % ((str(round(vals[i]*100,1))[:3])), (ind[i]+0.2, vals[i]), 
                      va="bottom", 
                      ha="center", 
                      fontsize=8)
     
      ax.xaxis.set_tick_params(width=0)
      ax.yaxis.set_tick_params(width=1, length=6)
     
      ax.set_xlabel("Principal Component")
      ax.set_ylabel("Variance Explained (%)")
      plt.title('Explained Variance Per Principal Component')
    scree_plot(pca)
    

    image

  • 20개 다 해도 65% 정도 밖에 표현 안됨

    # Redesign
    new_coordinates = np.vstack((X_pca[:,:2].T, y)).T
    dataframe = pd.DataFrame(data=new_coordinates, columns=("1st_principal", "2nd_principal", "label"))
    print(dataframe.head())
    
  • 원래 784+1(y) 차원인 것을 1st, 2nd, label(y) 로 표현해줌
  • X_pca[:,:2].T는 X_pca의 처음 두 열을 취하고, 이들의 행과 열을 전환, t= transpose
  • vstack: 수직으로 y를 쌓음
  • 다시.T해줌으로 행과 열이 바뀌어, 열이 총 3개가 됨
  • label(0~9)

    # Plotting
    sns.FacetGrid(dataframe, hue="label", size=10).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()
    plt.show()
    

    image

카테고리:

업데이트:

댓글남기기