18.PCA Code
[목적]
- Dimensionality Reduction 중 가장 보편적으로 사용되는 PCA Code 실습 진행
- Variance를 Maximize 하는 기법
- Class Label, Learning Algorithm이 아님
[Process]
- Define Data
- Modeling
- Plotting
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.datasets as data
from keras.datasets import mnist
from sklearn.decomposition import PCA
- mnist라는 사진 데이터 쓸꺼야
# MNIST Data Loading
(X_train, Y_train),(X,y) = mnist.load_data()
del X_train
del Y_train
-
너무 많아서 x,y train은 삭제
# Data Shape 확인 print(">>>> MNIST Data Shape : {}".format(X.shape)) print(">>>> MNIST Label Shape : {}".format(y.shape))
-
10000,28,28 -> 28 x 28 차원
# Flatten (1, 28, 28) to (1, 784) X = X.reshape(-1, 28*28)
-
일자로 쭉 펴주는 작업
# MNIST Visualization def show_images(num_images): if num_images % 10 == 0 and num_images <= 100: for digit_num in range(0,num_images): plt.subplot(int(num_images/10),10,digit_num+1) #create subplots mat_data = X[digit_num].reshape(28,28) #reshape images plt.imshow(mat_data) #plot the data plt.xticks([]) #removes numbered labels on x-axis plt.yticks([]) #removes numbered labels on y-axis show_images(50)
-
상위 50개만 보여주기
# 특정 숫자 보여주기 def show_images_by_digit(digit_to_see): if digit_to_see in list(range(10)): indices = np.where(y == digit_to_see) # pull indices for num of interest for digit_num in range(0,50): plt.subplot(5,10, digit_num+1) #create subplots #reshape images mat_data = X[indices[0][digit_num]].reshape(28,28) plt.imshow(mat_data) #plot the data plt.xticks([]) #removes numbered labels on x-axis plt.yticks([]) #removes numbered labels on y-axis show_images_by_digit(7)
-
7로 인식 되는 거만 보여주기
[PCA Parameters]
- Packge : https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- n_components : 몇 개의 축으로 차원을 축소할 것인가
- explained_variance_ : The amount of variance explained by each of the selected components.
- 이게 Eigenvalue
- explained_variance_ratio_ : Percentage of variance explained by each of the selected components.
# PCA Fitting
pca = PCA(n_components = 20)
X_pca = pca.fit_transform(X)
print("PCA Output shape : {}".format(X_pca.shape))
- 784 -> 20 차원으로 축소
-
자체적으로 scale해주니 따로 해줄 필요는 없음
# EigenValue print(pca.explained_variance_)
-
Eigenvalue값
# EigenValue Ratio print(pca.explained_variance_ratio_)
-
Eigenvalue가 얼마나 잘 나타내는지? 상위 2개 합쳐서 17%.. 3차원? -> 23%
sum(pca.explained_variance_ratio_)
def scree_plot(pca): num_components = len(pca.explained_variance_ratio_) ind = np.arange(num_components) vals = pca.explained_variance_ratio_ plt.figure(figsize=(10, 6)) ax = plt.subplot(111) cumvals = np.cumsum(vals) ax.bar(ind, vals) ax.plot(ind, cumvals) for i in range(num_components): ax.annotate(r"%s%%" % ((str(round(vals[i]*100,1))[:3])), (ind[i]+0.2, vals[i]), va="bottom", ha="center", fontsize=8) ax.xaxis.set_tick_params(width=0) ax.yaxis.set_tick_params(width=1, length=6) ax.set_xlabel("Principal Component") ax.set_ylabel("Variance Explained (%)") plt.title('Explained Variance Per Principal Component') scree_plot(pca)
-
20개 다 해도 65% 정도 밖에 표현 안됨
# Redesign new_coordinates = np.vstack((X_pca[:,:2].T, y)).T dataframe = pd.DataFrame(data=new_coordinates, columns=("1st_principal", "2nd_principal", "label")) print(dataframe.head())
- 원래 784+1(y) 차원인 것을 1st, 2nd, label(y) 로 표현해줌
- X_pca[:,:2].T는 X_pca의 처음 두 열을 취하고, 이들의 행과 열을 전환, t= transpose
- vstack: 수직으로 y를 쌓음
- 다시.T해줌으로 행과 열이 바뀌어, 열이 총 3개가 됨
-
label(0~9)
# Plotting sns.FacetGrid(dataframe, hue="label", size=10).map(plt.scatter, '1st_principal', '2nd_principal').add_legend() plt.show()
댓글남기기