FB1 WARNING:本文不含LightGBM原理解释,主要讲重要参数(较一般文章多、新)以及演示案例,文章中有相当部分的官网英文,担心自己翻译带有个人色彩,故摘选原文,如果英语太差看起来可能会有点蛋疼。


(1)一些基本介绍 (2)sklearn全部可见参数+少许其他参数 (3)原生接口大部分参数,远多于寻常文章(摘选了比较多,实际上一般情况用不到这么多,相当于记了一个笔记,也加入了一些中文翻译和自己的理解,忘记了可以没事翻一翻,免得看官网看得头皮发麻



1,Histogram算法:直方图算法,直方图并非light gbm独创,不过用直方图的确快很多;












区别:原生接口predict,略有区别,二分类中直接给出了label=1的概率,而不是0和1的概率;自定义评估函数时,要返回三个值,多一个布尔值(is higher the better,True or False)不然报错;官网并没有写feval(自定义评估函数)的参数,即只看官网你根本不知道可以自定义评估函数,但自己写个评估函数,传给feval完全没问题且完美运行,即实际上原生接口train里面有feval参数,官网可能漏掉了。




# 载入数据及前期处理
breast_cancer = datasets.load_breast_cancer(as_frame=True)
data = breast_cancer.data
target = breast_cancer.target
# 划分训练集和测试集
=train_test_split(data,target,test_size=0.2, random_state=1024,stratify=target)


from lightgbm import LGBMClassifier
model_light = LGBMClassifier

咱们按照顺序挨个说: 有个黑方块即参数讲解,因为写这文章的逻辑顺序大改过一次,参数干脆不标序号了。

█boosting, default =gbdt, options:gbdt,rf,dart ,aliases:boosting_type,boost



1 drop_rate , default = 0.1, type = double, aliases: rate_drop, constraints: 0.0 <= drop_rate <= 1.0
dropout rate: a fraction of previous trees to drop during the dropout

2 max_drop , default = 50, type = int
max number of dropped trees during one boosting iteration

3 skip_drop , default = 0.5, type = double, constraints: 0.0 <= skip_drop <= 1.0
probability of skipping the dropout procedure during a boosting iteration

4 xgboost_dart_mode , default = false, type = bool
set this to true, if you want to use xgboost dart mode

5 uniform_drop︎ , default = false, type = bool
set this to true, if you want to use uniform drop

6 drop_seed ︎, default = 4, type = int
random seed to choose dropping models

Note: internally, LightGBM uses gbdt mode for the first 1 / learning_rate iterations

█num_leaves ,default =31 ,这个数字最大值为2**max_depth -1 ,相当于在最大深度已限制的情况下,进一步防止过拟合,要设置在上限之下,lightgbm的叶子结点,是选择分裂增益大的那一边往下分(leaf_wise),所以在叶子结点总数相同的情况下,比XGBOOTS(level-wise)肯定要深很多,所以这个参数也是重要的要调参数。

█max_depth 最大深度,default =-1 ,<=0 means no limit,树模型调参重要参数。


【Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves.】

█learning_rate 学习率,一般0.1就算不小了,也是重要参数。

█n_estimators : int, optional (default=100),树的多少课或者叫迭代次数,skelarn接口并没有early_stop这种参数,但实际运行中,程序会报警告,说没有分裂增益了,会自动停止,看来属于内置了,所以设大点没问题。

█(不太重要)subsample_for_bin, int (default= 200000) 构建列直方图时,每个特征都会被分箱,此参数控制单个特征的单个箱子所能容纳的最多样本数。

  • number of data that sampled to construct feature discrete bins

  • setting this to larger value will give better training result, but may increase data loading time

  • set this to larger value if data is very sparse

  • Note: don’t set this to small values, otherwise, you may encounter unexpected errors and poor accuracy

█objective, default =regression ;可选的如下options:regression,regression_l1,huber,fair,poisson,quantile,mape,gamma,tweedie,binary,multiclass,multiclassova,cross_entropy,cross_entropy_lambda,lambdarank,rank_xendcg, aliases:objective_type,app,application,loss

  • 回归用:regression,默认L2 LOSS,即MSE。


  • 分类用:binary,二分类,要求标签是{0,1};

  • 多分类,multiclass,num_class should be set as well,同时要告诉模型你有多少个类别。


  • 排序lambdarank,rank_xendcg

█ class_weight:dict, 'balanced' or None, optional (default=None)样本权重,按说明文档,多分类则指定该参数,balanced则自动,比如三类,0类10个,1类90个,2类900个,则自动将0类权重设为1000/10=100,1类为1000/90,2类为1000/900,该权重会乘在交叉熵损失函数前面,在不断迭代降低损失函数过程中,一个数值前面顶着一个很大的系数,会被优先照顾,一般设置了权重,则模型会重点找全这个类别,但隐患是可能会增加误判,简称召回棒棒哒、精确率血崩。如果传字典就要一一对应,比如{'0':100,'1':11,'2':1.1},默认无,大家都是1。

当然多分类别忘了 ▲num_class ,多分类指定有多少类别


█ is_unbalance,default =false, type = bool,正负比例是否均衡,设为TRUE,算法会自动平衡权重。

  • used only in binary and multiclassova applications

  • set this to true if training data are unbalanced

  • Note: while enabling this should increase the overall performance metric of your model, it will also result in poor estimates of the individual class probabilities

  • Note: this parameter cannot be used at the same time with scale_pos_weight, choose only one of them

█scale_pos_weight , default = 1.0, type = double, constraints: scale_pos_weight > 0.0,这个值的设置,比如样本中0有2000个,1有40个,则设为50。(=neg numbers/pos numbers)

  • used only in binary and multiclassova applications

  • weight of labels with positive class

  • Note: while enabling this should increase the overall performance metric of your model, it will also result in poor estimates of the individual class probabilities

  • Note: this parameter cannot be used at the same time with is_unbalance, choose only one of them

█min_split_gain,默认为0, the minimal gain to perform split,由于boosting算法训练中,会不断拟合残差,所以这个参数,不是特别好控制,如果出现过拟合,则尝试设置此值,一般先设得很小,再不停地试,就是下图右边那个gamma

█ min_child_weight , default = 1e-3, type = double,

minimal sum hessian in one leaf. Like min_data_in_leaf, it can be used to deal with over-fitting


█min_child_samples,default =20, type = int, 一个叶子结点最少的样本数量

  • minimal number of data in one leaf. Can be used to deal with over-fitting.

  • Note: this is an approximation based on the Hessian, so occasionally you may observe splits which produce leaf nodes that have less than this many observations



Note: to enable bagging, bagging_freq should be set to a non zero value as well


█subsample_freq ,default =0, type = int, aliases:

0 means disable bagging; k means perform bagging at every k iteration. Every k-th iteration, LightGBM will randomly select bagging_fraction * 100 % of the data to use for the next k iterations。


█colsample_bytree, 列采样比例,和行采样差不多,type = double,

对比 1.随机森林  2.XGBOOST 3.LightGbm的行抽样选择
# 行抽样对比

2.XGBOOST:有个subsample行抽样,是不放回抽样(不能进行放回抽样,如上例,8000个样本绝对不重复) ,当然还有平均抽样和按梯度抽样参数sampling_method

3.LightGbm: 也是不放回抽样,多了个隔多少次迭代重新抽样的参数bagging_freq,当然还有个正负样本抽样比例

# 列抽样对比
1.随机森林: 每次分裂,考虑的列数为0.6*n_feature_in,每次分裂前都会选择一次,意味着每次分裂每个特征都有机会被选中然后遍历阈值进行分裂收益对比。



# 如果看不明白得去官网看英文原文,其实看不看得明白根本无所谓,反正用就行了

█reg_alpha ,L1正则化,如果要用则必须>=0 。

█reg_lambda ,L2正则化,如果要用则必须>=0


█n_jobs ,default =0 ,0 means default number of threads in OpenMP;


█importance_type: str, optional (default='split')


The type of feature importance to be filled into ``feature_importances_``. If 'split', result contains numbers of times the feature is used in a model. If 'gain', result contains total gains of splits which use the feature.

█**kwargs ,warning:\*\*kwargs is not supported in sklearn, it may cause unexpected issues.


1.2 fit参数

model_light.fit(x_train,y_train,eval_set=[(x_train, y_train),(x_test,y_test)]

█ x █y,训练要传入的数据

█(不太重要)sample_weight , (default=None)这个同class_weight一样,如果两个都设置的话,会相乘,一般我们习惯只用class_weight

█(不太重要)init_score ,模型第一棵树,各个样本的初始值,其实不用传,因为模型迭代多了,就算初始值偏得十万八千里远,结果还是会掰正回来,硬要传要跟y_train形状一致。

█eval_set ,只传测试集,或者训练、测试都放进去,这样可以看到训练过程的损失不断降低。

█(不太重要)eval_names ,给eval_set 起个名字,没什么卵用。

█(不太重要)eval_sample_weight ,测试集权重,list of array (same types as ``sample_weight`` supports) (default=None)

█eval_class_weight,Class weights of eval data,还有一层,这关于权重的参数真多。

█(不太重要)eval_init_score,list of array (same types as ``init_score`` supports), or None, optional (default=None),测试集每个数据初始值,不用管这个。



Parameters — LightGBM documentation在官网搜索 Metric Parameters

█(不太重要)feature_name,list of str, or 'auto', optional (default='auto'), If 'auto' and data is pandas DataFrame, data columns names are used。基本不用管。

█(亮点)categorical_feature,list of str or int, or 'auto', optional (default='auto')




个人认为,就X,Y,eval_set ,eval_metric,init_model常用。

1.3 predict参数

y_pred_label = model_light.predict(x_test)


█(不太重要)start_iteration ,int, optional (default=0)

If <= 0, starts from the first iteration,一般不动

█(不太重要)num_iteration ,int or None, optional (default=None)

Total number of iterations used in the prediction. If None, if the best iteration exists and start_iteration <= 0, the best iteration is used; otherwise, all iterations from ``start_iteration`` are used (no limits). If <= 0, all iterations from ``start_iteration`` are used (no limits).


█(高阶应用)pred_leaf : bool, optional (default=False),如果设置为true,则返回给你这个样本,落在每棵树的哪个叶子上,是个index, shape=(len(x_test,num_iters)),有什么用呢?


█(不太重要)pred_contrib,bool, optional (default=False),如果设置为true,则会返回一个shape(n_samples,n_classes,n_features +1)的数组,对每个预测数据,返回每个特征的贡献值,其实不需要用到,要了解的话,请查看SHAP值相关内容。

█(不太重要)validate_features ,验证预测数据集的特征列名,是否和fit训练过程中,传入的列名完全一致,数量不一致或者名称,多个空格都会报错,不太建议开启。

 1.4 predict_prob


1.5 补充几个重要属性: 

model_light.feature_name_ 获取训练数据列名
model_light.feature_importances_ 获取特征列重要性
# 一一对上


2.1 转为Dataset数据格式

# 转换为Dataset数据格式
lgb_train = lgb.Dataset(x_train, y_train)
lgb_eval = lgb.Dataset(x_test, y_test, reference=lgb_train)

下面精炼简短点说,首先█data,█label 两个自不必说,在验证数据集上,有说明一般都要加上█reference=lgb_train;,既然这么说就加上。


█ group,█position排序问题才会用到;

█ init_score,初始值,无影响,不建用;

█ feature_name ,█ categorical_feature ,上面fit方法已说明;

█ free_raw_data ,bool, optional (default=True)别动它,用完数据就释放坑位解除占用。




# 参数dict
params = {
    # 核心参数
    'objective': 'binary', 
    'boosting_type': 'gbdt',  
    'metric': ['auc','binary_error'], 
    'learning_rate': 0.05,   
    'num_leaves': 63,  
    'nthread': -1,
    # 训练控制参数        
    'min_sum_hessian_in_leaf':1e-3, # 搞不懂同样的东西为什么有两个
    'max_delta_step':0,  # 设为0,该参数不好控制
    'feature_contri':None, # 特征增益控制不设
    'path_smooth':2, # 叶子权重平滑
    'verbosity': 0,
    # IO 参数 


# 自定义评估函数
def accuracy1(preds, train_data):
    labels = train_data.get_label()
    preds = 1. / (1. + np.exp(-preds))
    return 'accuracy', np.mean(labels == (preds > 0.5)), True
# 训练+预测
model = lgb.train(params,train_set=lgb_train,num_boost_round=200,valid_sets=lgb_eval
y_ = model.predict(x_test)



Core Parameters核心参数:



█data_sample_strategy , default = bagging, options: bagging, goss


baggingis only effective whenbagging_freq>0(隔几个树重新行采样)andbagging_fraction<1.0(样本不放回抽样比例),默认使用bagging,比较不同的是,多了一个隔几棵树重新行采样,随机森林、XGB都是每棵树都按照行比例抽样。

▲goss, Gradient-based One-Side Sampling----这就是传说中的单边梯度采样,light gbm的名字由来,跟GOSS有很大关系,如果用了goss,则不可用行抽样和行抽样间隔,但是可以用列抽样。


▲top_rate:default=0.2, type=double

仅仅在 goss 时使用, 大梯度数据的保留比例

▲other_rate:default=0.1, type=int

仅仅在 goss 时使用, 小梯度数据的保留比例

比如样本数量总共100万,取前20%的大梯度样本,小梯度样本取10%,但是这样的信息增益会变少,如何解决,就是 1−�� ,(1-20%)/0.1,小梯度样本权重变为8倍,差不多可以近似还原本来的信息度。

█num_iterations ,default =100 迭代次数,在model.train()中,有个num_boost_round,也是迭代次数,实测参数字典中的num_iterations优先级要高一些,即是两个地方都设置了,会用参数字典里面那个。


█(不重要)tree_learner:default =serial ,是否多个机器分布式计算,单人单机可直接忽略这个参数

  • serial, single machine tree learner 单机,默认值

  • feature, feature parallel tree learner, aliases: feature_parallel 特征并行

  • data, data parallel tree learner, aliases: data_parallel 数据并行

  • voting, voting parallel tree learner, aliases: voting_parallel 投票并行

  • refer to Distributed Learning Guide to get more details

█device_type ,default =cpu ,options:cpu,gpu,cuda(选CPU还是显卡,lightgbm因为比较快,一般CPU可以搞定)

  • cpu supports all LightGBM functionality and is portable across the widest range of operating systems and hardware

  • cuda offers faster training than gpu or cpu, but only works on GPUs supporting CUDA

  • gpu can be faster than cpu and works on a wider range of GPUs than CUDA


█(不重要)deterministic default =false, type = bool,当用不同版本、不同机器时让结果固定不变,看说明文档,大意是写着不用管这东西,并且setting this totruemay slow down the training。

  • used only with cpu device type

  • setting this to true should ensure the stable results when using the same data and the same parameters (and different num_threads)

█ seed ,this seed is used to generate other seeds, e.g. data_random_seed, feature_fraction_seed, etc.


█num_threads ,default =0 ,-------------------同Sklearn中的n_jobs ,aliases:num_thread,nthread,nthreads,n_jobs

Learning Control Parameters

█force_col_wise 和force_row_wise,两个参数只能选一个而且必须选一个为TRUE,而且实际运行必然一个为True,即传说中的直方图算法,而且必须用直方图。

█force_col_wise , default = false, type = bool used only with cpu device type

set this to true to force col-wise histogram building

enabling this is recommended when: 列直方图推荐场景

  • the number of columns is large, or the total number of bins is large

  • num_threads is large, e.g. > 20

  • you want to reduce memory cost

Note: when both force_col_wise and force_row_wise are false, LightGBM will firstly try them both, and then use the faster one. To remove the overhead of testing set the faster one to true manually。


█force_row_wise,default =false, type = bool

used only with cpu device type

set this to true to force row-wise histogram building

enabling this is recommended when:行直方图推荐场景

  • the number of data points is large, and the total number of bins is relatively small

  • num_threads is relatively small, e.g. <= 16

  • you want to use small bagging_fraction or goss sample strategy to speed up

  • Note: setting this to true will double the memory cost for Dataset object. If you have not enough memory, you can try setting force_col_wise=true

  • Note: when both force_col_wise and force_row_wise are false, LightGBM will firstly try them both, and then use the faster one. To remove the overhead of testing set the faster one to true manually

█(不重要)histogram_pool_size , default =-1.0 ,max cache size in MB for historical histogram,< 0 means no limit。内存限制

█max_depth 最大深度,-------------------同Sklearn中的

█min_data_in_leaf,default =20, type = int--------------同Sklearn中的min_child_samples最小叶子样本数量


>>>[Warning] min_data_in_leaf is set=5, min_samples_leaf=0.001 will be ignored. Current value: min_data_in_leaf=5


█min_sum_hessian_in_leaf , default = 1e-3, type = double-------------------我个人的理解是和min_data_in_leaf是一个道理,都是叶子里面二阶导之和,回归的单个样本二阶导是1,分类的二阶导是sigmoid*(1-sigmoid),咱就说,一个参数完全控制叶子结点样本数,一个参数控制二阶导之和,可以理解,但这里又多一个,就有点让人捉摸不透,也可能是我只用过分类和回归,在其他任务里,可能有不同的作用!


█bagging_freq, -------------------同Sklearn中的subsample_freq

█pos_bagging_fraction, █neg_bagging_fraction,正负样本采样比例,只在二分类中使用,在处理样本不平衡时使用, 默认都是1,表示disable,要用则两个都要设置,同时bagging_freq 也要设。

used for imbalanced binary classification problem, will randomly sample #pos_samples * pos_bagging_fraction positive samples in bagging, will randomly sample #neg_samples * neg_bagging_fractionnegative samples in bagging 意思是二分类,样本不平衡,直接设置正负样本的采样比例,两个要设都一起设, 并且会忽略bagging_fraction行采样比例这个参数。

  • Note: if both pos_bagging_fraction and neg_bagging_fraction are set to 1.0, balanced bagging is disabled

  • Note: if balanced bagging is enabled, bagging_fraction will be ignored

█bagging_seed ,default =3,random seed for bagging,固定行抽样结果的种子,多次运行进行结果比对,经常需要固定住种子。

█feature_fraction, 列采样比例-------------------同Sklearn中的colsample_bytree

█feature_fraction_bynode,子节点列采样比例,如果特征列非常多,可以考虑用, 例如数据本身一万个特征列,列采样选了8000行,下一级结点再乘0.8,再下一级再乘0.8,以此类推。官网特别说明了,这个并不能像列采样一样加速训练。

Note: unlike feature_fraction, this cannot speed up training

Note: if both feature_fraction and feature_fraction_bynode are smaller than 1.0, the final fraction of each node is feature_fraction * feature_fraction_bynode

█feature_fraction_seed , default =2,random seed for feature_fraction,固定列抽样种子。

█(不重要可忽略)extra_trees ,default =false, type = bool是否开启极限森林模式,听着就不是很聪明,if set to true, when evaluating node splits LightGBM will check only one randomly-chosen threshold for each feature,---------------------不建议用,极限森林是为每一个特征,随便瞎几把选一个分裂点(阈值threshold),然后对比分裂增益,在这些矮个里,挑最好的一个作为分裂条件,这里应该也是如此。同时还有一个extra_seed, random seed for selecting thresholds when extra_trees is true。

█early_stopping_round ,default =0, type = int,will stop training if one metric of one validation data doesn’t improve in last early_stopping_round rounds,重要,一般都要设置此参数,先把迭代次数搞大点,相当于设个berak机制。(许多两三年前的文章,都将此参数放在.train(early_..=20)里面,现在可能更新了,实测只能放字典里面不然报错)aliases:early_stopping_rounds,early_stopping,n_iter_no_change

█first_metric_only,default =false, type = bool, 配合early_stopping_round使用,如果有多个evaluation metrics(即参数字典里的metric),设为True则只用第一个,默认是只要评估函数列表里有一个没有变化,就会停止,这样可使模型更加专注你需要的那个结果;

█max_delta_step,default =0.0, type = double,限制每次迭代生成的树的叶子结点的结果最大值,the final max output of leaves is learning_rate * max_delta_step, 由于不太好控制,一般都不会优先考虑设此参数。


█lambda_l2 ,L2正则化,如果要用则必须>=0




█min_data_per_group , default = 100, type = int, constraints: min_data_per_group > 0 minimal number of data per categorical group

█max_cat_threshold,default =32, type = int.

█cat_l2,default =10.0, type = double,constraints:cat_l2>=0.0,

  • used for the categorical features

  • L2 regularization in categorical split

█cat_smooth,default =10.0,type = double, constraints:cat_smooth>=0.0,used for the categorical features.

this can reduce the effect of noises in categorical features, especially for categories with few data。

█max_cat_to_onehot 🔗︎, default = 4, type = int, constraints: max_cat_to_onehot > 0,when number of categories of one feature smaller than or equal to max_cat_to_onehot, one-vs-other split algorithm will be used


█interaction_constraints, default ="", type =string

  • controls which features can appear in the same branch

  • by default interaction constraints are disabled, to enable them you can specify

  • for Python-package, list of lists, e.g. [[0, 1, 2], [2, 3]]

  • any two features can only appear in the same branch only if there exists a constraint containing both features

图来源 https://zhuanlan.zhihu.com/p/99069186 

█verbosity, default =1,type = int, aliases:verbose < 0: Fatal, = 0: Error (Warning), = 1: Info, > 1: default =1,


█(认为不太需要用到)feature_contri , default = None, type = multi-double, aliases: feature_contrib, fc, fp, feature_penalty

  • used to control feature’s split gain, will use gain[i] = max(0, feature_contri[i]) * gain[i] to replace the split gain of i-th feature

  • you need to specify all features in order

█path_smooth, default = 0, type = double, constraints: path_smooth >= 0.0 这个看起来是一个船新的控制叶子结点权重的参数,LightGbm的叶子结点公式同xgboost如下:

  • controls smoothing applied to tree nodes

  • helps prevent overfitting on leaves with few samples

  • if set to zero, no smoothing is applied

  • if path_smooth > 0 then min_data_in_leaf must be at least 2

  • larger values give stronger regularization

  • the weight of each node is w * (n / path_smooth) / (n / path_smooth + 1) + w_p / (n / path_smooth + 1), where n is the number of samples in the node, w is the optimal node weight to minimise the loss (approximately -sum_gradients / sum_hessians), and w_p is the weight of the parent node

  • note that the parent output w_p itself has smoothing applied, unless it is the root node, so that the smoothing effect accumulates with the tree depth

█use_quantized_grad,default =false, type = bool,是否使用分位数梯度计算。

  • whether to use gradient quantization when training

  • enabling this will discretize (quantize) the gradients and hessians into bins of num_grad_quant_bins

  • with quantized training, most arithmetics in the training process will be integer operations

  • gradient quantization can accelerate training, with little accuracy drop in most cases

  • Note: can be used only with device_type = cpu

  • New in version 4.0.0


█num_grad_quant_bins ,default =4, type = int,设置用多少个分位数。

  • number of bins to quantization gradients and hessians

  • with more bins, the quantized training will be closer to full precision training

  • Note: can be used only with device_type = cpu

  • New in 4.0.0


█quant_train_renew_leaf 🔗︎, default = false, type = bool,

  • whether to renew the leaf values with original gradients when quantized training

  • renewing is very helpful for good quantized training accuracy for ranking objectives

  • Note: can be used only with device_type=cpu


█stochastic_rounding 🔗︎, default = true, type = bool,当开启分位数梯度计算时,是否随机舍入,猜一下,可能是在四舍五入和floor round,ceil round之间,每次都随机三选一吧,默认是开,就不管了。

whether to use stochastic rounding in gradient quantization

IO Parameters

█max_bin , default =255 ,列的直方图,最大箱子数

  • max number of bins that feature values will be bucketed in

  • small number of bins may reduce training accuracy but may increase general power (deal with over-fitting)

█max_bin_by_feature , default =None ,个人理解,该参数是传入一个list或者字典,给每个特征指定bins,不过一般用Lightgbm都是特征多样本多,这参数估计不常用。

  • max number of bins for each feature

  • if not specified, will use max_bin for all features

█min_data_in_bin, default =3

  • minimal number of data inside one bin

  • use this to avoid one-data-one-bin (potential over-fitting)

█bin_construct_sample_cnt , default = 200000, type = int,-------------------同Sklearn中的subsample_for_bin,控制单个特征的单个箱子所能容纳的最多样本数。

█data_random_seed , default = 1, type = int, aliases: data_seed,构建直方图的seed,固定住可用于对比结果。

  • random seed for sampling data to construct histogram bins

█is_enable_sparse, default =true ,是否优化稀疏的特征,很多高维数据,很多都是0,1这种,但并没有解释怎么优化,而且并非互斥捆绑算法,反正默认开着就开着呗。

█enable_bundle,default =true,这就是传说中的互斥捆绑算法Exclusive Feature Bundling (EFB),看来参数都是默认开启了高性能快速模式。

Note: disabling this may cause the slow training speed for sparse datasets

█feature_pre_filter,default =true, type = bool

  • set this to true (the default) to tell LightGBM to ignore the features that are unsplittable based on min_data_in_leaf

  • as dataset object is initialized only once and cannot be changed after that, you may need to set this to false when searching parameters with min_data_in_leaf, otherwise features are filtered by min_data_in_leaf firstly if you don’t reconstruct dataset object

  • Note: setting this to false may slow down the training

  • 按照文档解释,这个属于是预先过滤,比如min_data_in_leaf默认是20,而你几百万的数据中,有那么几个特征是只有19个是0,其余全是1,即可看做是异常值或者数据采集错误;或者更调皮的特征,比如19个1、2、3一直往下加到所有样本,这种特征即可提前过滤掉;同时也提醒了,调参的时候,要关掉此参数,因为会预先过滤特征。

█two_round ,default =false,

  • set this to true if data file is too big to fit in memory

  • by default, LightGBM will map data file to memory and load features from memory. This will provide faster data loading speed, but may cause run out of memory error when the data file is very big

  • Note: works only in case of loading data directly from text file


█categorical_feature ,lightgbm的优势之一是它可以很好地处理分类特性,可以手动指定。如果很多,有段代码比较好用-----data.select_dtypes('object').columns.tolist(),直接变成列表放进去,其实就相当于自带了一个LabelEncoder()而已。

  • used to specify categorical features

  • use number for index, e.g. categorical_feature=0,1,2 means column_0, column_1 and column_2 are categorical features

  • add a prefix name: for column name, e.g. categorical_feature=name:c1,c2,c3 means c1, c2 and c3 are categorical features

  • Note: all values will be cast to int32 (integer codes will be extracted from pandas categoricals in the Python-package)

  • Note: index starts from 0 and it doesn’t count the label column when passing type is int

  • Note: all values should be less than Int32.MaxValue (2147483647)

  • Note: using large values could be memory consuming. Tree decision rule works best when categorical features are presented by consecutive integers starting from zero

  • Note: all negative values will be treated as missing values

  • Note: the output cannot be monotonically constrained with respect to a categorical feature

  • Note: floating point numbers in categorical features will be rounded towards 0

█precise_float_parser 🔗︎, default = false, type = bool

  • use precise floating point number parsing for text parser (e.g. CSV, TSV, LibSVM input)

  • Note: setting this to true may lead to much slower text parsing


Objective Parameters

█ num_class,多分类要在参数字典里面设;

█ is_unbalance█ scale_pos_weight,上面sklearn已讲


█sigmoid,default =1.0, type = double, constraints:sigmoid>0.0 ,used only in binary and multiclassova classification and in lambdarank applications


█boost_from_average,default =true, type = bool

used only in regression, binary, multiclassova and cross-entropy applications


█reg_sqrt , default = false, type = bool,(感觉不聪明)有点像提前把预测值先log取对数,预测完毕返回时再exp指数搞回来,如果回归任务中数值标签数字太大的话,可以尝试,个人理解,数据分布比较偏态,取对数可以矫正回来,不过一般情况,我们都是自己手动先把数据调整再喂给模型,模型预测了再手动逆转回来,而且取对数还必须>0;

  • used only in regression application

  • used to fit sqrt(label) instead of original values and prediction result will be also automatically converted to prediction^2

  • might be useful in case of large-range labels


Metric Parameters





Network Parameters

GPU Parameters

Predict Parameters

2.3 model.predict参数


█pred_early_stop , default = false, type = bool,也就是如果设置了 early_stop,预测的时候,不会默认用最佳迭代,XGBOOST会默认用最佳迭代的,所以还是要手动设一下!!!

  • used only in prediction task

  • used only in classification and ranking applications

  • used only for predicting normal or raw scores

  • if true, will use early-stopping to speed up the prediction. May affect the accuracy

  • Note: cannot be used with rf boosting type or custom objective function


2.4 补充

lightgbm可能有2个东西偶尔会用到,from lightgbm import plot_importance,plot_tree,自带了一些方法,也有一些参数可以调整,这里只说明下,没怎么调整图形:




# 有CV就直接传全部数据了,不分train,test
full_datasets = lgb.Dataset(data=data,label=target)
# 参数字典接原生接口,注释掉了early_stop和num_iterations(=300,优先级高会覆盖),其他同
light_cv = lgb.cv(params,full_datasets,num_boost_round=200,nfold=5
                 metrics=['auc','binary_logloss'],feval=accuracy1, # 随便加了个自定义accuracy
cv_df = pd.DataFrame(light_cv)

lightgbm.cv — LightGBM documentation

结合官网的解释来看,同xgboost的CV几乎差不多,一般要设一个early_stop,这样可以用来寻找最佳迭代次数,但其实并没有太重要,因为其他的调参手段,比如Gridsearch,Randomsearch ,贝叶斯调参完全可以替代。



