机器学习：分类模型性能评估(1)：混淆矩阵及其可视化

但是在有些情况下，人们的确会习惯于按照日常习惯的倾向来进行正、负的标识，但是也仍然不是“正”有积极向上或者更好的含义。比如说，说在癌症检测问题中，两个类别分别为‘阳性’和‘阴性’，将‘阳性’标识为‘正’类而‘阴性’标识为‘负’类就是一个自然的选择，你当然不能说‘阳性’代表更好更积极向上。另外一个例子，在‘cat-or-not’分类中两个类别分别是‘是猫’和‘非猫’，那么将‘是猫’标识为‘正’类而‘非猫’标识为‘负’类就是一个自然的选择，虽然从逻辑上来说并没有任何必然性。

当然，‘正’、‘负’类别的标识方法以及可能由字面意义带来的歧义其实只是二分类问题特有的，在多分类问题中就不存在这一问题了。

3. 多分类情况下的混淆矩阵

多分类的混淆矩阵例如下图所示：

图 2 多分类情况下的混淆矩阵

每一行之和表示该类别的真实样本数量#Ck_gt，每一列之和表示被预测为该类别的样本数量#Ck_pred.

对角线上的元素标识正确分类的结果，非对角线的元素都标识错误分类的结果。

4. 混淆矩阵的可视化

4.1 sklearn. confusion_matrix() and plot_confusion_matrix()

sklearn.metrics包中提供了confusion_matrix() 方法用于根据预测结果以及标签真值）（Ground Truth）生成混淆矩阵。而另一个方法plot_confusion_matrix()则用于直接绘制图示化的混淆矩阵。

以下代码中先利用make_classification()创建了一个二分类的玩具数据集，然后实例化了一个支持向量机分类器，对齐进行训练和预测。然后调用plot_confusion_matrix()绘制该分类应用于该数据集的的测试集时的混淆矩阵。

# Example 1: Using sklearn plot_confusion_matrix 
# Ref: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(
        X, y, random_state=0)
clf = SVC(random_state=0)
clf.fit(X_train, y_train)

plot_confusion_matrix(clf, X_test, y_test)  
plt.show()

运行后的效果如下所示：

图3 plot_confusion_matrix()绘制混淆矩阵示例

4.2 seaborn heatmap()

Ref: https://www.stackvidhya.com/plot-confusion-matrix-in-python-and-why/

利用seaborn库中的heatmap绘制功能画出来的图会更漂亮一些。

Seaborn heatmap()方法的调用参数如下所示（data为必须的参数，其余为可选参数用于控制图示效果选项。更多的参数请参考Seaborn heatmap()文档）：

data – A rectangular dataset that can be coerced into a 2d array. Here, you can pass the confusion matrix you already have
annot=True – To write the data value in the cell of the printed matrix. By default, this is False.
cmap=Blues – This is to denote the matplotlib color map names.

heatmap()方法返回matplotlib axes，可以存储于一个变量，以便于后面进一步修改图示效果选项，比如说，设置title, x-axis and y-axis labels and tick labels for x-axis and y-axis. 注意，也可以在heatmap()的参数列表中用ax参数来指定用于存储matplotlib axes的变量，如以下例所示。

Title – Used to label the complete image. Use the set_title() method to set the title.
Axes-labels – Used to name the x axis or y axis. Use the set_xlabel() to set the x-axis label and set_ylabel() to set the y-axis label.
Tick labels – Used to denote the datapoints on the axes. You can pass the tick labels in an array, and it must be in ascending order. Because the confusion matrix contains the values in the ascending order format. Use the xaxis.set_ticklabels() to set the tick labels for x-axis and yaxis.set_ticklabels() to set the tick labels for y-axis.

最后需要调用plot.show() 方法以显示该图.

# Example2: Using seaborn heatmap
# Ref: https://www.stackvidhya.com/plot-confusion-matrix-in-python-and-why/
import seaborn as sns
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

print('Example2: Using seaborn heatmap for confusion matrix visualization')
sns.set()
f,ax = plt.subplots()
# y_true = [0,0,1,2,2,0,2,0,1]
y_pred = clf.predict(X_test)
C2 = confusion_matrix(y_test,y_pred,labels=[0,1])
# print C2
print(C2)
sns.heatmap(C2,annot=True,ax=ax) #plot heatmap
# ax.plot(C2)

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('predict') 
ax.set_ylabel('true') #
plt.show()

运行后的效果如下图所示：