大家好,今天讲一下数据分析中的因子分析。因子分析是主成分分析的推广和发展,是将具有错综复杂关系的变量综合为少数几个因子,以再现原始变量与因子之间的相互关系;根据不同的因子还可以对变量进行分类,也属于多元分析中降维处理的一种统计方法。

例如,一个学生的英语、数学、语文成绩都很好,那么潜在的共性因子可能是智力水平高。因此,因子分析的过程其实就是寻找共性因子和个性因子并得到最优解释的过程。

一、参数估计

1.1 主成分法

实例:下表所示为各参赛队男子径赛运动记录的部分数据,8项径赛运动分别是100m(x1)、200m(x2)、400m(x3)、800m(x4)、1500m(x5)、5000m(x6)、10000m(x7)、马拉松(x8),x1~x3的单位为秒,x4~x8的单位为分。使用这个数据,在前m个特征值的比重大于90%的标准下求主成分解。

import numpy as np
# 读取数据
sport = np.loadtxt(r'D:/data/sport.csv',delimiter = ',', 
                   skiprows = 1,usecols = (1,2,3,4,5,6,7,8))
# 计算相关矩阵Correlation Matrix
cor_mat = np.corrcoef(sport.T)
print('相关矩阵:\n', cor_mat)
# 将cor_mat以csv格式存储
np.savetxt('D:/data/cor_mat.csv', cor_mat, delimiter = ',')

# 计算特征值和特征向量
eig_value_pcm, eig_vector_pcm = np.linalg.eig(cor_mat)
# 把特征值从大到小排序
eig_pcm = eig_value_pcm[np.argsort(-eig_value_pcm)]
print('排序后的特征值:\n', eig_pcm)
# 确定公共因子的个数m,这里使用前m个特征值的比重大于90%的标准
for m in range(1,len(eig_pcm)):
    if eig_pcm[:m].sum()/eig_pcm.sum() >= 0.9:
        print('特征值的比重大于90%时,','m={}'.format(m))
        break
# 计算因子载荷阵A
a1 = np.sqrt(eig_value_pcm[0])*eig_vector_pcm[:,0]
a2 = np.sqrt(eig_value_pcm[1])*eig_vector_pcm[:,1]
A_pcm = np.vstack((a1,a2)).T    
A_pcm = np.mat(A_pcm)
print('因子载荷矩阵:\n', A_pcm)
# 计算特殊因子方差σ^2和共性方差h^2的估计
h_pcm = np.zeros(8)        # 共性方差
D_pcm = np.mat(np.eye(8))  # 特殊因子方差
for i in range(8):
    a = A_pcm[i,:]*A_pcm[i,:].T    
    h_pcm[i] = a[0,0]
    D_pcm[i,i] = 1-h_pcm[i]
print('共性方差:\n', h_pcm)
print('特殊因子方差:\n', np.diag(D_pcm))
# 计算残差矩阵
cancha_mat_pcm = cor_mat - A_pcm*A_pcm.T - D_pcm
# 将cancha_ma_pcmt以csv格式存储
np.savetxt('D:/data/cancha_mat_pcm.csv', cancha_mat_pcm, delimiter = ',')
print('残差矩阵:\n', cancha_mat_pcm)

输出结果:

相关矩阵:
 [[1.         0.92263844 0.84114677 0.75602776 0.70023818 0.61946177
  0.63253891 0.51994902]
 [0.92263844 1.         0.85072702 0.80662654 0.77495127 0.69537702
  0.6965391  0.59618369]
 [0.84114677 0.85072702 1.         0.8701714  0.83526937 0.77861385
  0.78720455 0.70499051]
 [0.75602776 0.80662654 0.8701714  1.         0.9180442  0.86359388
  0.86904892 0.80647642]
 [0.70023818 0.77495127 0.83526937 0.9180442  1.         0.92811404
  0.93469696 0.86554922]
 [0.61946177 0.69537702 0.77861385 0.86359388 0.92811404 1.
  0.97463538 0.9321884 ]
 [0.63253891 0.6965391  0.78720455 0.86904892 0.93469696 0.97463538
  1.         0.94317634]
 [0.51994902 0.59618369 0.70499051 0.80647642 0.86554922 0.9321884
  0.94317634 1.        ]]
排序后的特征值:
 [6.62214613 0.87761829 0.15932114 0.12404939 0.07988027 0.06796515
 0.04641953 0.0226001 ]
特征值的比重大于90%时, m=2
因子载荷矩阵:
 [[-0.81718497  0.53105812]
 [-0.86716652  0.43245706]
 [-0.91520118  0.23258563]
 [-0.94875449  0.01164452]
 [-0.95937139 -0.1309633 ]
 [-0.93766329 -0.29231413]
 [-0.94383533 -0.28747025]
 [-0.87989646 -0.41122586]]
共性方差:
 [0.949814   0.93899688 0.89168928 0.90027067 0.93754485 0.96466
 0.97346426 0.94332449]
特殊因子方差:
 [0.050186   0.06100312 0.10831072 0.09972933 0.06245515 0.03534
 0.02653574 0.05667551]
残差矩阵:
 [[ 0.         -0.01565684 -0.03025836 -0.02546407 -0.01419657  0.00845322
   0.01391428  0.01929569]
 [-0.01565684  0.         -0.0434881  -0.02113734 -0.00034748  0.00868012
   0.00239524  0.01100447]
 [-0.03025836 -0.0434881   0.         -0.00083818 -0.01228827 -0.01154863
  -0.00973321 -0.00464655]
 [-0.02546407 -0.02113734 -0.00083818  0.          0.00936129 -0.02261451
  -0.02307163 -0.02354077]
 [-0.01419657 -0.00034748 -0.01228827  0.00936129  0.         -0.00973572
  -0.0084397  -0.03245376]
 [ 0.00845322  0.00868012 -0.01154863 -0.02261451 -0.00973572  0.
   0.00560403 -0.01306534]
 [ 0.01391428  0.00239524 -0.00973321 -0.02307163 -0.0084397   0.00560403
   0.         -0.00551622]
 [ 0.01929569  0.01100447 -0.00464655 -0.02354077 -0.03245376 -0.01306534
  -0.00551622  0.        ]]

1.2 主因子法

import numpy as np
# 读取数据
sport = np.loadtxt(r'D:/data/sport.csv', delimiter = ',', skiprows = 1, 
                   usecols = (1,2,3,4,5,6,7,8))
# 计算相关矩阵Correlation Matrix
cor_mat = np.corrcoef(sport.T)
# 设置特殊方差σ^2的初始估计
D_estimated = np.diag(1/np.diag((np.mat(cor_mat).I)))
# 计算约相关矩阵R*
R_yue = cor_mat - D_estimated

# 计算特征值和特征向量
eig_value_fm, eig_vector_fm = np.linalg.eig(R_yue)
# 把特征值从大到小排序
eig_fm = eig_value_fm[np.argsort(-eig_value_fm)]
print('排序后的特征值:\n', eig_fm)

# 计算因子载荷阵A
a1 = np.sqrt(eig_value_fm[0])*eig_vector_fm[:,0]
a2 = np.sqrt(eig_value_fm[1])*eig_vector_fm[:,1]
A_fm = np.vstack((a1,a2)).T    
A_fm = np.mat(A_fm)
print('因子载荷矩阵:\n', A_fm)
# 计算特殊因子方差σ^2和共性方差h^2的估计
h_fm = np.zeros(8)        # 共性方差
D_fm = np.mat(np.eye(8))  # 特殊因子方差
for i in range(8):
    a = A_fm[i,:]*A_fm[i,:].T    
    h_fm[i] = a[0,0]
    D_fm[i,i] = 1-h_fm[i]
print('共性方差:\n', h_fm)
print('特殊因子方差:\n', np.diag(D_fm))
# 计算残差矩阵
cancha_mat_fm = cor_mat - A_fm * A_fm.T - D_fm
print('残差矩阵:\n', cancha_mat_fm)
# 将cancha_mat_fm以csv格式存储
np.savetxt('D:/data/cancha_mat_fm.csv', cancha_mat_fm, delimiter = ',')

结果输出:

排序后的特征值:
 [ 6.53017050e+00  7.77833570e-01  5.05910740e-02  6.12216657e-03
 -1.35498577e-02 -1.48507931e-02 -3.55553704e-02 -5.32171257e-02]
因子载荷矩阵:
 [[-0.80688051  0.49602464]
 [-0.85785026  0.41211834]
 [-0.89970476  0.21600757]
 [-0.93857683  0.02386917]
 [-0.95567614 -0.11432049]
 [-0.9383177  -0.28212239]
 [-0.94616758 -0.28119691]
 [-0.87396916 -0.37813843]]
共性方差:
 [0.8970966  0.90574859 0.85612793 0.8814962  0.92638605 0.96003314
 0.97430479 0.90681076]
特殊因子方差:
 [0.1029034  0.09425141 0.14387207 0.1185038  0.07361395 0.03996686
 0.02569521 0.09318924]
残差矩阵:
 [[ 0.          0.02603493  0.00804745 -0.0131313  -0.01417249  0.00229116
   0.00857533  0.00232632]
 [ 0.02603493  0.         -0.01010563 -0.00836876  0.00223782  0.00670875
   0.0007554   0.00228681]
 [ 0.00804745 -0.01010563  0.          0.02057343  0.00013709 -0.00465447
  -0.00332627  0.00035706]
 [-0.0131313  -0.00836876  0.02057343  0.          0.02379746 -0.01035534
  -0.01229011 -0.00478493]
 [-0.01417249  0.00223782  0.00013709  0.02379746  0.         -0.00086617
  -0.00167939 -0.01291122]
 [ 0.00229116  0.00670875 -0.00465447 -0.01035534 -0.00086617  0.
   0.00749765  0.00544636]
 [ 0.00857533  0.0007554  -0.00332627 -0.01229011 -0.00167939  0.00749765
   0.          0.0099237 ]
 [ 0.00232632  0.00228681  0.00035706 -0.00478493 -0.01291122  0.00544636
   0.0099237   0.        ]]
二、因子旋转

对上例中求得的因子载荷矩阵进行旋转,并分析旋转后的因子载荷矩阵。

# 定义方差最大法varimax函数
def varimax(Phi, gamma = 1.0, q = 20, tol = 1e-6):
    from scipy import eye, asarray, dot, sum
    from scipy.linalg import svd
    p,k = Phi.shape
    R = eye(k)
    d=0
    for i in range(q):
        d_old = d
        Lambda = dot(Phi, R)
        u,s,vh = svd(dot(Phi.T,asarray(Lambda)**3 - (gamma/p) * 
                         dot(Lambda, 
                             np.diag(np.diag(dot(Lambda.T,Lambda))))))
        R = dot(u,vh)
        d = sum(s)
        if d_old!=0 and d/d_old < 1 + tol: break
    return dot(Phi, R)
# 计算主成分法旋转后的因子载荷合计
A_xuanzhuan_pcm = varimax(A_pcm)
print('旋转后的主成分法求得的因子载荷矩阵:\n',A_xuanzhuan_pcm)
# 计算主成分法旋转后的因子载荷合计
A_xuanzhuan_fm = varimax(A_fm)
print('旋转后的主因子法求得的因子载荷矩阵:\n',A_xuanzhuan_fm)

输出结果:

旋转后的主成分法求得的因子载荷矩阵:
 [[-0.30789702  0.92466936]
 [-0.40844569  0.87873147]
 [-0.5706041   0.75239633]
 [-0.73457435  0.60055907]
 [-0.83177754  0.49567224]
 [-0.91539014  0.35597876]
 [-0.91719639  0.36361386]
 [-0.94435176  0.2269895 ]]
旋转后的主因子法求得的因子载荷矩阵:
 [[-0.32917733  0.88810973]
 [-0.42104951  0.85350214]
 [-0.57495318  0.72495295]
 [-0.72405917  0.597691  ]
 [-0.82275225  0.4994645 ]
 [-0.9125871   0.35667622]
 [-0.91819515  0.3622464 ]
 [-0.92115801  0.24140977]]

根据代码的结果可以知道,经过旋转后,因子有了较为明确的意义。主成分法和主因子法的因子载荷经过因子旋转之后给出了大致相同的结果在因子上的载荷依次增大,在因子f1*上的载荷依次减小,于是可以称f2*为耐力因子。

三、 因子得分

根据上例的数据,计算因子得分。

import scipy.stats as ss
# 样本数据标准化
sport_ss = np.array(ss.zscore(sport))
# 计算因子得分
defen = A_xuanzhuan_pcm.T * np.mat(cor_mat).I * np.mat(sport_ss).T
# 将defen以csv格式存储
np.savetxt('D:/data/defen.csv', defen, delimiter = ',')
print('因子得分:\n',defen.T)

输出结果:

因子得分:
 [[-0.31192231 -0.22647167]
 [ 0.60499561 -0.78018777]
 [ 0.57464796  0.21266768]
 [ 0.80186253 -0.27841048]
 [-1.41333334 -1.30761394]
 [ 0.04724742 -0.92103717]
 [-0.43193331  0.6987439 ]
 [ 0.19948942 -0.8484143 ]
 [-0.01840606 -0.26282464]
 [ 0.11486743  0.40135454]
 [ 0.45539014  0.3257845 ]
 [-2.22358216  3.85157949]
 [ 0.41771405  1.96873346]
 [ 0.40155914 -0.35981447]
 [ 0.601477    0.0540694 ]
 [-2.17056842 -1.64361289]
 [ 0.79341172 -0.06930847]
 [ 0.32827036 -0.95460124]
 [ 0.58560351 -0.89492431]
 [ 0.5065176  -0.97062287]
 [ 0.74006068 -0.97431125]
 [-0.28774727 -0.59970678]
 [ 0.03422973  1.7241543 ]
 [ 0.26780415 -0.4222868 ]
 [ 0.50497639  0.52941767]
 [-1.24767873  0.16372766]
 [ 0.90646886  0.58069509]
 [ 0.31957139  0.67422033]
 [ 0.17143796 -1.50049013]
 [ 0.65733692  0.04572758]
 [ 1.02333783 -0.0805589 ]
 [-0.26614473 -0.20324707]
 [ 0.53454584  1.72965915]
 [-0.25521496 -0.18138273]
 [-1.6807917  -1.03346258]
 [-0.85798761  1.602976  ]
 [ 0.77337281  0.54314692]
 [ 0.94871571  0.21141051]
 [ 1.11065842  0.38667441]
 [ 0.95845702  0.69164488]
 [-1.14129125  1.02779431]
 [-0.76203127  0.34671628]
 [ 0.30280777 -0.87878326]
 [ 1.1572367   0.87498085]
 [ 0.72335203  0.15455899]
 [-2.15885429 -0.74363949]
 [ 0.79427354  0.06478814]
 [ 0.50722501 -0.37239776]
 [ 0.63416998 -0.23566255]
 [-0.2632418   0.26778655]
 [-1.96829968 -0.73129511]
 [ 0.84722983  1.2300313 ]
 [ 0.30638316 -1.77387133]
 [ 0.2971525  -1.27984043]
 [-3.49482922  0.16573647]]

根据代码的结果可知,每个队伍两个因子的得分数值分别按因子得分f1、f2数值大小由高到低排序。按照f1进行排序反映了运动员的耐力由好到差,按照f1进行排序反映了运动员的速度由快到慢。‍

Logo

开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!

更多推荐