降维方法——t_SNE原理及python代码详解

t_SNE原理和公式，有python代码，把手写数字数据集降为二维

易羽潼

3939人浏览 · 2023-09-05 16:08:13

易羽潼 · 2023-09-05 16:08:13 发布

t_SNE是在SNE的基础上提出来的，是一种将高维数据降到2或者3维的方法，首先介绍SNE的公式，t_SNE只是在其基础上进行了一些修改

1.Stochastic Neighbor Embedding（SNE）

1.1 实现方法

SNE的思想是首先将高维数据的欧氏距离转换为条件概率分布，以此来衡量两个高维点之间的相似度（similarities），高维点 $x_{j}$ 对于 $x_{i}$ 的相似度用条件概率 $p(j|i)$ 来表示，这个概率与它们在以 $x_{i}$ 为中心的高斯分布下的概率密度成比例，对每一个数据点i都有一个这样的分布：

对于邻近的数据点， $p(j|i)$ 相对较高，而对于相隔较远的数据点， $p(j|i)$ 非常小。其中 $\sigma_{i}$ 是以数据点 $x_{i}$ 为中心的高斯分布的方差，确定 $\sigma_{i}$ 值的方法将在后面介绍。

假设转换后的低维点为 $y_{1},y_{2},$ …… $,y_{n}$ ，则同样可以定义其相似度，我们将用于计算条件概率 $p(j|i)$ 的高斯方差设为 $\tfrac{1}{\sqrt{2}}$ ，则条件概率为：

如果映射点 $y_{i}$ 和 $y_{j}$ 正确地模拟了高维数据点 $x_{i}$ 和 $x_{j}$ 之间的相似性，则条件概率p(j|i)和q(j|i)将相等。因此SNE旨在找到低维的数据表示，以最小化p(j|i)和q(j|i)之间的差距，也就是希望这两个概率分布足够的接近。p(j|i)和q(j|i)两者的接近程度使用KL散度来度量，SNE使用梯度下降法最小化所有数据点上的KL散度之和，因此代价函数C为：

但是这个代价函数对于p和q并不是对称的，也就是说当使用距离较远的低维特征点来表示距离较近的高维点时，产生的损失非常大（此时q比较小，p比较大），但是使用距离较近的低维特征点来表示距离较远的高维点时，产生的损失相对较小。

In particular, there is a large cost for using widely separated map points to represent nearby datapoints(i.e., for using a small q(j|i) to model a large p(j|i),but there is only a small cost for using nearby map points to represent widely separated datapoints.[1]

有了代价函数就可以使用大家非常熟悉的梯度下降法来求得对应y，使损失函数尽量小，代价函数对于转换后的低维点 $y_{i}$ 的梯度如下

1.2 二分法求方差

对于数据集中的所有数据点来说，不可能存在单个σi值是最优的，因为数据的密度很可能会变化。在密集地区，较小的σi值通常比在稀疏地区更合适。SNE对σi的值进行二分搜索，生成用户指定的具有固定困惑度(perplexity)的 $P_{i}$ ，在降维的过程中，困惑度是首先被指定好的，困惑被定义为

这种困惑可以解释为邻居有效数量的平滑测量，SNE的性能对困惑度的变化相当稳健，典型值在5到50之间。

The perplexity can be interpreted as asmooth measure of the effectivenumber of neighbors. The performance of SNE isfairly robust to changes in the perplexity,and typical values are between 5 and 50.

2.t-Distributed Stochastic Neighbor Embedding(t_SNE)

2.1 Symmetric SNE

作为最小化条件概率之间的kullbackleibler散度之和的另一种方法，也可以最小化高维空间中的联合概率分布P和低维空间中的联合概率分布Q之间的单个KL散度，因此引入一种对称的SNE，称为Symmetric SNE，因为 $p_{ij}=p_{ji}, q_{ij}=q_{ji}$ ，代价函数为：

其中

可以看到就是把SNE中的条件概率变成了一种对称的形式。

2.2 t_SNE

为了解决SNE仍然存在的拥挤问题(Crowding Problem)，t_SNE在symmetric SNE的基础上将 $q_{ij}$ 修改了一下：

为什么要这样修改呢？我们希望让稍有间隔的点在低维空间中拉得很开。

以下图为例，横轴是两个数据点的欧氏距离。原始的 SNE 选用的是蓝线，而在新的相似度量下，如果我们想要维持同样的概率，需要发生这样的变化：距离接近的点变得更近；距离较远的点变得更远。这样一来，SNE 的拥挤问题就得到了解决。

此时代价函数对应的梯度为

3.Python代码实现

# coding utf-8
'''
代码参考了作者Laurens van der Maaten的开放出的t-sne代码, 并没有用类进行实现,主要是优化了计算的实现
'''
import numpy as np
from sklearn.datasets import load_digits


def cal_pairwise_dist(x):
    '''计算pairwise 距离, x是matrix
    (a-b)^2 = a^2 + b^2 - 2*a*b
    '''
    sum_x = np.sum(np.square(x), 1)
    dist = np.add(np.add(-2 * np.dot(x, x.T), sum_x).T, sum_x)
    return dist


def cal_perplexity(dist, idx=0, beta=0.01):
    '''计算perplexity, dist是距离向量，
    idx指dist中自己与自己距离的位置，beta是高斯分布参数
    这里的perp仅计算了熵，方便计算
    '''
    prob = np.exp(-dist * beta)
    # 设置自身prob为0
    prob[idx] = 0
    sum_prob = np.sum(prob)
    if sum_prob == 0:
        sum_prob= 1e-10
    perp = np.log(sum_prob) + beta * np.sum(dist * prob) / sum_prob
    prob /= sum_prob
    return perp, prob


def seach_prob(x, tol=1e-5, perplexity=30.0):
    '''二分搜索寻找beta,并计算pairwise的prob
    '''

    # 初始化参数
    print("Computing pairwise distances...")
    (n, d) = x.shape
    dist = cal_pairwise_dist(x)
    pair_prob = np.zeros((n, n))
    beta = np.ones((n, 1))/100
    # 取log，方便后续计算
    base_perp = np.log(perplexity)

    for i in range(n):
        if i % 500 == 0:
            print("Computing pair_prob for point %s of %s ..." % (i, n))

        betamin = -np.inf
        betamax = np.inf
        perp, this_prob = cal_perplexity(dist[i], i, beta[i])

        # 二分搜索,寻找最佳sigma下的prob
        perp_diff = perp - base_perp
        tries = 0
        while np.abs(perp_diff) > tol and tries < 50:
            if perp_diff > 0:
                betamin = beta[i].copy()
                if betamax == np.inf or betamax == -np.inf:
                    beta[i] = beta[i] * 2
                else:
                    beta[i] = (beta[i] + betamax) / 2
            else:
                betamax = beta[i].copy()
                if betamin == np.inf or betamin == -np.inf:
                    beta[i] = beta[i] / 2
                else:
                    beta[i] = (beta[i] + betamin) / 2

            # 更新perb,prob值
            perp, this_prob = cal_perplexity(dist[i], i, beta[i])
            perp_diff = perp - base_perp
            tries = tries + 1
        # 记录prob值
        pair_prob[i,] = this_prob
    print("Mean value of sigma: ", np.mean(np.sqrt(1 / beta)))
    return pair_prob


def pca(x, no_dims=50):
    ''' PCA算法
    使用PCA先进行预降维
    '''
    print("Preprocessing the data using PCA...")
    (n, d) = x.shape
    x = x - np.tile(np.mean(x, 0), (n, 1))
    l, M = np.linalg.eig(np.dot(x.T, x))
    y = np.dot(x, M[:, 0:no_dims])
    return y


def tsne(x, no_dims=2, initial_dims=50, perplexity=30.0, max_iter=2000):
    """Runs t-SNE on the dataset in the NxD array x
    to reduce its dimensionality to no_dims dimensions.
    The syntaxis of the function is Y = tsne.tsne(x, no_dims, perplexity),
    where x is an NxD NumPy array.
    """

    # Check inputs
    if isinstance(no_dims, float):
        print("Error: array x should have type float.")
        return -1
    if round(no_dims) != no_dims:
        print("Error: number of dimensions should be an integer.")
        return -1

    # 初始化参数和变量
    x = pca(x, initial_dims).real
    (n, d) = x.shape
    initial_momentum = 0.5
    final_momentum = 0.8
    eta = 300 # 调整更新的参数，类似learning_rate
    min_gain = 0.01
    y = np.random.randn(n, no_dims)
    dy = np.zeros((n, no_dims))
    iy = np.zeros((n, no_dims))
    gains = np.ones((n, no_dims))

    # 对称化
    P = seach_prob(x, 1e-5, perplexity) # n*n的条件概率矩阵
    P = P + np.transpose(P) # 对称
    P = P / np.sum(P)
    # early exaggeration
    P = P * 4
    P = np.maximum(P, 1e-12)

    # Run iterations
    for iter in range(max_iter):
        # Compute pairwise affinities
        sum_y = np.sum(np.square(y), 1)
        '''
        下面这一步是计算低维点两两的距离，代码写得比较抽象，解释一下：
        y = [a b      y.T = [a c
             c d]            b d]
        y 中是两个二维数据，他们的欧氏距离的平方：(a-c)^2+(b-d)^2 = a^2+c^2+b^2+d^2-2ac-2bd
        np.add(-2 * np.dot(y, y.T), sum_y) = [a^2+b^2 ac+bd       + [a^2+b^2 = [-a^2-*b^2          -2ac-2bd+a^2+b^2
                                              ac+bd   c^2+d^2]*-2    c^2+d^2]   -2ac-2bd+c^2+d^2   -*c^2-d^2]
        np.add(np.add(-2 * np.dot(y, y.T), sum_y).T, sum_y) = [0     -2ac-2bd+c^2+d^2+a^2+b^2
                                                               -2ac-2bd+a^2+b^2+c^2+d^2   0]
        '''
        num = 1 / (1 + np.add(np.add(-2 * np.dot(y, y.T), sum_y).T, sum_y)) # q_ij的分子，两两低维点的距离，公式(4)
        num[range(n), range(n)] = 0 # q_ii=0
        Q = num / np.sum(num) # 低维数据的条件概率矩阵Q
        Q = np.maximum(Q, 1e-12)

        # Compute gradient
        PQ = P - Q
        for i in range(n):
            pqyij = PQ[:, i] * num[:, i]
            dy[i, :] = np.sum(np.tile(pqyij, (no_dims, 1)).T * (y[i, :] - y), 0) # 计算梯度，公式(5)

        # Perform the update
        if iter < 20:
            momentum = initial_momentum
        else:
            momentum = final_momentum
        gains = (gains + 0.2) * ((dy > 0) != (iy > 0)) + (gains * 0.8) * ((dy > 0) == (iy > 0)) # 动态变化的学习率？
        gains[gains < min_gain] = min_gain
        iy = momentum * iy - eta * (gains * dy) # 更新量
        y = y + iy
        y = y - np.tile(np.mean(y, 0), (n, 1))
        # Compute current value of cost function
        if (iter + 1) % 100 == 0:
            if iter > 100:
                C = np.sum(P * np.log(P / Q))
            else:
                C = np.sum(P / 4 * np.log(P / 4 / Q))
            print("Iteration ", (iter + 1), ": error is ", C)
        # Stop lying about P-values
        if iter == 100:
            P = P / 4
    print("finished training!")
    return y


if __name__ == "__main__":
    # Run Y = tsne.tsne(X, no_dims, perplexity) to perform t-SNE on your dataset.
    X, labels = load_digits(return_X_y=True)
    Y = tsne(X, 2, 50, 20.0)
    from matplotlib import pyplot as plt

    plt.scatter(Y[:, 0], Y[:, 1], 20, labels)
    plt.show()

运行之后的结果：

4.特点

主要用于可视化，很难用于其他目的。基于前面的学习，我们不难看出 t-SNE 只能由高维数据集产生低维数据集，这种关系是多对多的。也就是说，如果额外添加一个数据，那么 t-SNE 是不能像 PCA 那样给出新数据在低维空间下的坐标的。
过于高维一般不使用，当数据维数过高时，两个矩阵的计算量是很大的。所以一般来说，我们会先用 PCA 降维到 10 维左右，再使用 t-SNE 降维到 2 或 3 维空间进行可视化。
t-SNE倾向于保存局部特征，对于本征维数(intrinsic dimensionality)本身就很高的数据集，是不可能完整的映射到2-3维的空间