19.T-SNE Code

2024-03-24 2 분 소요

[목적]

Dimensionality Reduction 중 고차원 Data에 적합한 T-SNE Code 실습
Reveal Non-linear Structure
- Text, Image 등 Data에 대해 시각화 하기 적합함
Class Label, Learning Algorithm이 아님

[Process]

Define Data
Modeling
Plotting

  import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.datasets as data

from keras.datasets import mnist
from sklearn.manifold import TSNE

  # MNIST Data Loading
(X_train, Y_train),(X,y) = mnist.load_data()

del X_train
del Y_train

  # Flatten (1, 28, 28) to (1, 784)
X = X.reshape(-1, 28*28)

  # MNIST Visualization
def show_images(num_images):
    if num_images % 10 == 0 and num_images <= 100:
        for digit_num in range(0,num_images): 
            plt.subplot(int(num_images/10),10,digit_num+1) #create subplots
            mat_data = X[digit_num].reshape(28,28)  #reshape images
            plt.imshow(mat_data) #plot the data
            plt.xticks([]) #removes numbered labels on x-axis
            plt.yticks([]) #removes numbered labels on y-axis
show_images(50)

  # 특정 숫자 보여주기
def show_images_by_digit(digit_to_see):
    if digit_to_see in list(range(10)):
        indices = np.where(y == digit_to_see) # pull indices for num of interest
        for digit_num in range(0,50): 
            plt.subplot(5,10, digit_num+1) #create subplots
            #reshape images
            mat_data = X[indices[0][digit_num]].reshape(28,28)
            plt.imshow(mat_data) #plot the data
            plt.xticks([]) #removes numbered labels on x-axis
            plt.yticks([]) #removes numbered labels on y-axis
show_images_by_digit(1)

[T-SNE Parameters]

Packge : https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
n_components : 몇개의 축으로 차원을 축소할 것 인가
perplexity : i 기준 주변 j를 몇개까지 고려할 것 인가
- The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms
- Consider selecting a value between 5 and 50
- default = 30
learning_rate : KL-Divergence를 최적화 할때 Gradient Descent를 활용함
- GD를 할때 Learning rate 의미
- The learning rate for t-SNE is usually in the range [10.0, 1000.0]
- default=‘auto’
n_iter : GD iteration
- Maximum number of iterations for the optimization
- 적어도 250번 이상 해야함
- default = 1000
init : 처음 저차원에 대한 정보를 시작할때 기준점
- {‘random’, ‘pca’}
- KL-Divergence 최적화 전 저차원의 분포와 고차원 분포를 setting 할때 사용
- Initialization of embedding. PCA initialization cannot be used with precomputed distances and is usually more globally stable than random initialization.
- default = ‘pca’
method : Gradient 계산할 때 쓰는 알고리즘
- {‘barnes_hut’, ‘exact’}
- Barnes-Hut approximation : O(NlogN) time
- exact : O(N^2) time
- exact는 시간 많이 걸림
  - 실험상 exact가 3% 더 에러가 적음
  - data가 작을 때는 exact로 돌리면 좋음
- default=’barnes_hut’
```
  %%time
# 2차원 t-SNE 임베딩 using exact
X_tsne_e = TSNE(n_components = 2, method='exact').fit_transform(X)
```

exact는 1시간 40분 걸림

%%time
# 2차원 t-SNE 임베딩 using barnes_hut
X_tsne_b = TSNE(n_components = 2, method='barnes_hut').fit_transform(X)

barnes-hut 는 3분 2초걸림

# exact Redesign = 1h 40min
new_coordinates_e = np.vstack((X_tsne_e[:,:2].T, y)).T
dataframe_e = pd.DataFrame(data=new_coordinates_e, columns=("Axis 1", "Axis 2", "label"))
# Plotting
sns.FacetGrid(dataframe_e, hue="label", size=10).map(plt.scatter, "Axis 1", "Axis 2").add_legend()
plt.show()

# Redesign
new_coordinates_b = np.vstack((X_tsne_b[:,:2].T, y)).T
dataframe_b = pd.DataFrame(data=new_coordinates_b, columns=("Axis 1", "Axis 2", "label"))
# Plotting
sns.FacetGrid(dataframe_b, hue="label", size=10).map(plt.scatter, "Axis 1", "Axis 2").add_legend()
plt.show()

오히려 barnes_hut이 더 잘 분류 한듯

Twitter Facebook LinkedIn

Lee SeungHyun

19.T-SNE Code

공유하기

댓글남기기

참고

코드 유사성 판단 경진 대회- private 3위

25.One-class SVM Code

24.Robust Random Cut Forest Code

23.Isolation Forest Code