xgb调参--简洁版

一、常调参数1、max_depth[默认6]树分裂最大深度，常用3~10之间树越深越容易过拟合（越深模型会学到越具体越局部的样本）树越深也会消耗更多内存且会使得训练时间变长（由于xgb会一直分裂到max_depth指定的值，再回过头来剪枝）2、eta[默认0.3]学习率，常用0.01~0.5之间太大准确率不高、难以收敛(梯度值可能在最优解附近晃荡，不收敛)太小运行速...

小媛在努力

9728人浏览 · 2020-04-28 21:50:36

小媛在努力 · 2020-04-28 21:50:36 发布

5、min_child_weight[默认1]

6、subsample[默认1]

7、colsample_bytree[默认1]

8、gamma[默认0]

9、scale_pos_weight[默认1]

10、tree_method [默认auto]

11、num_boost_round—迭代次数

12、early_stopping_rounds

1、sklearn.model_selection.GridSearchCV——（网格搜索）

2、sklearn.model_selection.RandomizedSearchCV——（随机搜索）

3、hyperopt——（贝叶斯优化）

四、xgb各种参数含义

一、常调参数

1、max_depth[默认6]

树分裂最大深度，常用3~10之间

树越深越容易过拟合（越深模型会学到越具体越局部的样本）

树越深也会消耗更多内存且会使得训练时间变长（由于xgb会一直分裂到max_depth指定的值，再回过头来剪枝）

2、eta[默认0.3]

学习率，常用0.01~0.5之间

太大准确率不高、难以收敛(梯度值可能在最优解附近晃荡，不收敛)

太小运行速度慢

经验：learning_rate * num_round >= 1 and learning_rate * num_round <= 10

3、lambda[默认1]

权重的L1正则化项

4、alpha[默认0]

权重的L2正则化项

5、min_child_weight[默认1]

最小叶子节点样本权重和(叶子节点中的样本二阶导求和)

值较大时，可减少过拟合

值过高，会导致欠拟合

6、subsample[默认1]

每棵树随机采样的比例，常用0.5~1之间

减少此参数值，算法会更加保守写，以避免一定程度过拟合，但值太小容易欠拟合

7、colsample_bytree[默认1]

控制每棵随机采样的列数的比例(每一列一个特征)，常用0.5~1之间

8、gamma[默认0]

节点分裂所需的最先损失下降

值越大，算法越保守

9、scale_pos_weight[默认1]

设置正样本权重值，以均衡正负样本权重

A typical value to consider: sum(negative instances) / sum(positive instances)

示例见https://github.com/dmlc/xgboost/blob/master/demo/kaggle-higgs/higgs-numpy.py

10、`tree_method` [默认`auto`]

树构造算法，可选auto, exact, approx, hist, gpu_hist,通常用于调整训练速度

auto：启发式选择最快的算法，表现为小数据选exact greedy，大数据选approx，建议大数据尝试使用hist或者gpu_hist,以获得更高的性能，gpu_hist支持外部内存

exact：精确贪婪算法，枚举所有候选项

approx：近似贪婪算法，using quantile sketch and gradient histogram(分位数简化图和梯度直方图)

hist：快速直方图优化approx

gpu_hist：hist算法的GPU实现

11、num_boost_round—迭代次数

12、early_stopping_rounds

迭代过程中，在n轮内是否有进步，没有就停止训练

触发这个参数（也就是确实提前停止了）的时候返回的变量会带有3个属性：best_score, best_iteration, best_ntree_limit ，这里best_ntree_limit 就是最好的模型的树的个数

但是在文档和源码中，有这么一句话 The method returns the model from the last iteration (not the best one). 就是说如果触发了这个参数的话，那么结果返回的并不是最好的模型，而是最后一轮的模型，那这不坑爹呢？！

但是后续再深入测试的时候发现，用各种指标去验证（比如rmse）的时候，结果却和最好的模型是一样的，并不是和最后一轮的模型一样，再深入的研究之后在源码中发现了这么一段代码，XGBoost在调用predict的时候tree_limit参数如果没指定默认用的就是best_ntree_limit，也就是在预测时候，用的还是最好的模型

二、调参方向/目的

1、过拟合

直接调整控制模型复杂度参数
- max_depth --> 调低
- min_child_weight --> 调高
- gamma --> 调高
增加随机性，使得训练对噪声具有鲁棒性
- subsample --> 调低
- colsample_bytree --> 调低
- eta and num_round --> 调低eta,调高num_round

2、优化性能

tree_method, 设置为 hist 或者 gpu_hist来加快计算速度

3、正负样本不均衡

如果想优化整体效果(AUC)
- 设置正样本权重系数scale_pos_weight
- 使用AUC作为评估标准
如果想提升准确率率(预测正确的概率)

- 设置max_delta_step为1-10之间，有助于收敛

三、调参方法

常用集成学习比较好总要需要调参的参数：

1、sklearn.model_selection.GridSearchCV——（网格搜索）

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import GridSearchCV


#导入训练数据
traindata = pd.read_csv("/traindata_4_3.txt",sep = ',')
traindata = traindata.set_index('instance_id')
trainlabel = traindata['is_trade']
del traindata['is_trade']
print(traindata.shape,trainlabel.shape)


#分类器使用 xgboost
clf1 = xgb.XGBClassifier()

#设定网格搜索的xgboost参数搜索范围，值搜索XGBoost的主要6个参数
param_dist = {
'n_estimators':range(80,200,4),
'max_depth':range(2,15,1),
'learning_rate':np.linspace(0.01,2,20),
'subsample':np.linspace(0.7,0.9,20),
'colsample_bytree':np.linspace(0.5,0.98,10),
'min_child_weight':range(1,9,1)
}


#GridSearchCV参数说明，clf1设置训练的学习器
#param_dist字典类型，放入参数搜索范围
#scoring = 'neg_log_loss'，精度评价方式设定为“neg_log_loss“
#n_iter=300，训练300次，数值越大，获得的参数精度越大，但是搜索时间越长
#n_jobs = -1，使用所有的CPU进行训练，默认为1，使用1个CPU
grid = GridSearchCV(clf1,param_dist,cv = 3,scoring = 'neg_log_loss',n_iter=300,n_jobs = -1)

#在训练集上训练
grid.fit(traindata.values,np.ravel(trainlabel.values))
#返回最优的训练器
best_estimator = grid.best_estimator_
print(best_estimator)
#输出最优训练器的精度



#自定义损失函数logloss
#===============================我是华丽丽的分割线===============================
import numpy as np
from sklearn.metrics import make_scorer
import scipy as sp

def logloss(act, pred):
    epsilon = 1e-15
    pred = sp.maximum(epsilon, pred)
    pred = sp.minimum(1-epsilon, pred)
    ll = sum(act*sp.log(pred) + sp.subtract(1, act)*sp.log(sp.subtract(1, pred)))
    ll = ll * -1.0/len(act)
    return ll

#这里的greater_is_better参数决定了自定义的评价指标是越大越好还是越小越好
loss = make_scorer(logloss, greater_is_better=False)
score = make_scorer(logloss, greater_is_better=True)

2、sklearn.model_selection.RandomizedSearchCV——（随机搜索）

sklearn.model_selection.RandomizedSearchCV( estimator, param_distributions, n_iter=10, scoring=None, n_jobs=None, iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score=nan, return_train_score=False, )

RandomizedSearchCV的使用方法其实是和GridSearchCV一致的，但它以随机在参数空间中采样的方式代替了GridSearchCV对于参数的网格搜索，在对于有连续变量的参数时，RandomizedSearchCV会将其当作一个分布进行采样这是网格搜索做不到的，它的搜索能力取决于设定的n_iter参数（数值越大，获得的参数精度越大，但是搜索时间越长

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.grid_search import RandomizedSearchCV


#导入训练数据
traindata = pd.read_csv("/traindata.txt",sep = ',')
traindata = traindata.set_index('instance_id')
trainlabel = traindata['is_trade']
del traindata['is_trade']
print(traindata.shape,trainlabel.shape)


#分类器使用 xgboost
clf1 = xgb.XGBClassifier()

#设定搜索的xgboost参数搜索范围，值搜索XGBoost的主要6个参数
param_dist = {
'n_estimators':range(80,200,4),
'max_depth':range(2,15,1),
'learning_rate':np.linspace(0.01,2,20),
'subsample':np.linspace(0.7,0.9,20),
'colsample_bytree':np.linspace(0.5,0.98,10),
'min_child_weight':range(1,9,1)
}

#RandomizedSearchCV参数说明，clf1设置训练的学习器
#param_dist字典类型，放入参数搜索范围
#scoring = 'neg_log_loss'，精度评价方式设定为“neg_log_loss“
#n_iter=300，训练300次，数值越大，获得的参数精度越大，但是搜索时间越长
#n_jobs = -1，使用所有的CPU进行训练，默认为1，使用1个CPU
grid = RandomizedSearchCV(clf1,param_dist,cv = 3,scoring = 'neg_log_loss',n_iter=300,n_jobs = -1)

#在训练集上训练
grid.fit(traindata.values,np.ravel(trainlabel.values))
#返回最优的训练器
best_estimator = grid.best_estimator_
print(best_estimator)
#输出最优训练器的精度
print(grid.best_score_)

3、hyperopt——（贝叶斯优化）

网格搜索速度慢，但在搜索整个搜索空间方面效果很好，而随机搜索很快，但可能会错过搜索空间中的重要点。

hyperopt：是python中的一个用于"分布式异步算法组态/超参数优化"的类库。使用它几乎可以摆脱繁杂的超参数优化过程，自动获取最佳的超参数。广泛意义上，可以将带有超参数的模型看作是一个必然的非凸函数，因此hyperopt几乎可以稳定的获取比手工更加合理的调参结果。尤其对于调参比较复杂的模型而言，其更是能以远快于人工调参的速度同样获得远远超过人工调参的最终性能。

def hyperopt_eval_func(params, X, y):
'''利用params里定义的模型和超参数，对X进行fit，并返回cv socre。
Args:
@params: 模型和超参数
@X:输入参数
@y:真值
Return:
@score: 交叉验证的损失值
''' 

int_feat = ['n_estimators', 'max_depth', 'min_child_weight']
for p in int_feat:
params[p] = int(params[p]) 

clf = XGBClassifier(**params) 

#用cv结果来作为评价函数
from sklearn.model_selection import KFold
shuffle = KFold(n_splits=5, shuffle=True)
score = -1 * cross_val_score(clf, X, y, scoring='f1', cv=shuffle).mean()

return score

def hyperopt_binary_model(params):
'''hyperopt评价函数，在hyperopt_eval_func外面包围了一层，增加一些信息输出
Args:
@params:用hyperopt调参优化得到的超参数
Return:
@loss_status: loss and status

''' 
global best_loss, count, binary_X, binary_y 
count += 1 

clf_type = params['type'] 
del params['type']
loss = hyperopt_eval_func(params, binary_X, binary_y)
print(count, loss)
if loss < best_loss:
ss = 'count:%d new best loss: %4.3f , using %s'%(count, loss, clf_type) 
print(ss) 
best_loss = loss

loss_status = {'loss': loss, 'status': STATUS_OK}
return loss_status

def get_best_model(best):
'''根据hyperopt搜索的参数，返回对应最优score的模型
Args:
@best:最优超参数
Return:
@clf: xgb model
''' 
int_feat = ['n_estimators', 'max_depth', 'min_child_weight']
for p in int_feat:
best[p] = int(best[p])

#fix the random state
best['seed'] = 2018 
clf = XGBClassifier(**best)

return clf


def get_best_model(X_train, y_train, predictors, max_evals_num=10):
'''利用hyperopt得到最优的xgb model
Args:
@X_train: 训练样本X 数据
@y_train: 训练样本y target
@predictors: 用于预测的特征
@max_evals_num: hyperopt调参时的次数，次数越多，模型越优，但是也越耗费时间
Return:
@clf: 最优model
'''
space = { 
'type': 'xgb',
'n_estimators': hp.quniform('n_estimators', 50,400,50), ##50~400，每间隔50
'max_depth': hp.quniform('max_depth', 2, 8, 1), 
##'learning_rate': hp.loguniform('learning_rate', np.log(0.005), np.log(0.2)) 
'learning_rate': hp.uniform('learning_rate', 0.01, 0.1), 
'min_child_weight': hp.quniform('min_child_weight', 2, 8, 1),
'gamma': hp.uniform('gamma', 0, 0.2),
'subsample': hp.uniform('subsample', 0.7, 1.0),
'colsample_bytree': hp.uniform('colsample_bytree', 0.7, 1.0) 
} 

#hyperopt train
global count, best_loss, binary_X, binary_y
count = 0
best_loss = 1000000
binary_X = X_train
binary_y = y_train
trials = Trials()
best = fmin(hyperopt_binary_model, space, algo=tpe.suggest, max_evals=max_evals_num, trials=trials)
print( 'best param:{}'.format(best))
print('best trans cv mse on train:{}'.format(best_loss)) 


clf = get_best_model(best)

return clf

定义参数空间可选择函数：

hp.pchoice(label,p_options)以一定的概率返回一个p_options的一个选项。这个选项使得函数在搜索过程中对每个选项的可能性不均匀。
hp.uniform(label,low,high)参数在low和high之间均匀分布。
hp.quniform(label,low,high,q),参数的取值round(uniform(low,high)/q)*q，适用于那些离散的取值。
hp.loguniform(label,low,high) 返回根据 exp（uniform（low，high））绘制的值，以便返回值的对数是均匀分布的。
优化时，该变量被限制在[exp（low），exp（high）]区间内。
hp.randint(label,upper) 返回一个在[0,upper)前闭后开的区间内的随机整数。
hp.normal(label, mu, sigma) where mu and sigma are the mean and standard deviation σ , respectively. 正态分布，返回值范围没法限制。
hp.qnormal(label, mu, sigma, q)
hp.lognormal(label, mu, sigma)
hp.qlognormal(label, mu, sigma, q)

from hyperopt import hp
from hyperopt.pyll.stochastic import sample

learning_rate = {'learning_rate': hp.loguniform('learning_rate', np.log(0.005), np.log(0.2))}

learning_rate_dist = []
for _ in range(10000):
learning_rate_dist.append(sample(learning_rate)['learning_rate'])

plt.figure(figsize = (8, 6))
sns.kdeplot(learning_rate_dist, color = 'r', linewidth = 2, shade = True)
plt.title('Learning Rate Distribution', size = 18)
plt.xlabel('Learning Rate', size = 16)
plt.ylabel('Density', size = 16)

在这里插入图片描述

num_leaves = {'num_leaves': hp.quniform('num_leaves', 30, 150, 1)}
num_leaves_dist = []
for _ in range(10000):
num_leaves_dist.append(sample(num_leaves)['num_leaves'])

plt.figure(figsize = (8,6))
sns.kdeplot(num_leaves_dist, linewidth = 2, shade = True)
plt.title('Number of Leaves Distribution', size = 18); plt.xlabel('Number of Leaves', size = 16); plt.ylabel('Density', size = 16)

在这里插入图片描述