基于多元线性回归模型的单车租赁数预测模型

本练习将基于kaggle竞赛中的sharebike数据集建立预测单车租赁数的多元线性回归模型，并通过RMSE，MSE等不同指标对模型进行评价。

Chenshuo_Xu

3265人浏览 · 2023-01-18 17:57:48

Chenshuo_Xu · 2023-01-18 17:57:48 发布

线性回归练习

在这个练习中，我们使用一个Kaggle竞赛中提供的共享单车的数据集：Bike Sharing Demand。
该数据集包含2011到2012年Capital Bikeshare系统中记录的每日每小时单车的租赁数，以及相应的季节和气候等信息。

数据列：

datetime - hourly date + timestamp
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy；2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist；3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds；4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - “feels like” temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals

第一步：读入数据

# read the data and set the datetime as the index
import pandas as pd

bikes = pd.read_csv('bikeshare.csv', index_col='datetime', parse_dates=True)

bikes.head()

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count
datetime
2011-01-01 00:00:00	1	0	0	1	9.84	14.395	81	0.0	3	13	16
2011-01-01 01:00:00	1	0	0	1	9.02	13.635	80	0.0	8	32	40
2011-01-01 02:00:00	1	0	0	1	9.02	13.635	80	0.0	5	27	32
2011-01-01 03:00:00	1	0	0	1	9.84	14.395	75	0.0	3	10	13
2011-01-01 04:00:00	1	0	0	1	9.84	14.395	75	0.0	0	1	1

第二步：可视化数据

用matplotlib画出温度“temp”和自行车租赁数“count”之间的散点图；
用seborn画出温度“temp”和自行车租赁数“count”之间带线性关系的散点图（提示：使用seaborn中的lmplot绘制）

# matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(1, figsize = (10, 25))

tem = bikes["temp"]

cou = bikes["count"]

plt.ylabel("number of total rentals")

plt.xlabel("temperature in Celsius")

plt.scatter(tem, cou, s = 5)

<matplotlib.collections.PathCollection at 0x1970d87a340>

在这里插入图片描述

# seaborn
import seaborn as sns

sns.lmplot(x = "temp", y = "count", data = bikes, height = 20, aspect = 1)

<seaborn.axisgrid.FacetGrid at 0x1970d056160>

在这里插入图片描述

第三步：一元线性回归

用温度预测自行车租赁数

# create X and y
feature_cols0 = ["temp"]

X0 = bikes[feature_cols0]

X0 = bikes[["temp"]]

y0 = bikes["count"]

# import, instantiate, fit
from sklearn.model_selection import train_test_split

X0_train, X0_test, y0_train, y0_test = train_test_split(X0, y0, random_state = 0)

from sklearn.linear_model import LinearRegression

linreg = LinearRegression()

linreg.fit(X0_train, y0_train)

LinearRegression()

# print the coefficients

list(zip(feature_cols0, linreg.coef_))

[('temp', 9.139291387753524)]

第四步：探索多个特征

# explore more features
feature_cols = ['temp', 'season', 'weather', 'humidity']

# using seaborn, draw multiple scatter plots between each feature in feature_cols and 'count'
sns.pairplot(bikes, x_vars = feature_cols, y_vars = 'count', height = 7, aspect = 0.7)

<seaborn.axisgrid.PairGrid at 0x1970d0561f0>

在这里插入图片描述

# correlation matrix (ranges from 1 to -1)
bikes.corr()

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count
season	1.000000	0.029368	-0.008126	0.008879	0.258689	0.264744	0.190610	-0.147121	0.096758	0.164011	0.163439
holiday	0.029368	1.000000	-0.250491	-0.007074	0.000295	-0.005215	0.001929	0.008409	0.043799	-0.020956	-0.005393
workingday	-0.008126	-0.250491	1.000000	0.033772	0.029966	0.024660	-0.010880	0.013373	-0.319111	0.119460	0.011594
weather	0.008879	-0.007074	0.033772	1.000000	-0.055035	-0.055376	0.406244	0.007261	-0.135918	-0.109340	-0.128655
temp	0.258689	0.000295	0.029966	-0.055035	1.000000	0.984948	-0.064949	-0.017852	0.467097	0.318571	0.394454
atemp	0.264744	-0.005215	0.024660	-0.055376	0.984948	1.000000	-0.043536	-0.057473	0.462067	0.314635	0.389784
humidity	0.190610	0.001929	-0.010880	0.406244	-0.064949	-0.043536	1.000000	-0.318607	-0.348187	-0.265458	-0.317371
windspeed	-0.147121	0.008409	0.013373	0.007261	-0.017852	-0.057473	-0.318607	1.000000	0.092276	0.091052	0.101369
casual	0.096758	0.043799	-0.319111	-0.135918	0.467097	0.462067	-0.348187	0.092276	1.000000	0.497250	0.690414
registered	0.164011	-0.020956	0.119460	-0.109340	0.318571	0.314635	-0.265458	0.091052	0.497250	1.000000	0.970948
count	0.163439	-0.005393	0.011594	-0.128655	0.394454	0.389784	-0.317371	0.101369	0.690414	0.970948	1.000000

sns.heatmap(bikes.corr())

<AxesSubplot:>

在这里插入图片描述

用’temp’, ‘season’, ‘weather’, ‘humidity’四个特征预测单车租赁数’count’

# create X and y
X4 = bikes[feature_cols]

X4 = bikes[['temp', 'season', 'weather', 'humidity']]

y4 = bikes['count']

# import, instantiate, fit
from sklearn.model_selection import train_test_split

X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, random_state=0)

from sklearn.linear_model import LinearRegression

linreg = LinearRegression()

linreg.fit(X4_train, y4_train)

LinearRegression()

# print the coefficients
list(zip(feature_cols, linreg.coef_))

[('temp', 7.801241497301735),
 ('season', 23.781793850855003),
 ('weather', 6.597649302354395),
 ('humidity', -3.120711730873758)]

使用train/test split和RMSE来比较多个不同的模型

# compare different sets of features
feature_cols1 = ['temp', 'season', 'weather', 'humidity']
feature_cols2 = ['temp', 'season', 'weather']
feature_cols3 = ['temp', 'season', 'humidity']

import numpy as np

from sklearn import metrics

X1 = bikes[feature_cols1]
X1 = bikes[['temp', 'season', 'weather', 'humidity']]
y1 = bikes['count']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, random_state=0)
linreg1 = LinearRegression()
linreg1.fit(X1_train, y1_train)
y1_pred = linreg1.predict(X1_test)

print("RMSE1:", np.sqrt(metrics.mean_squared_error(y1_test,y1_pred)))

X2 = bikes[feature_cols2]
X2 = bikes[['temp', 'season', 'weather']]
y2 = bikes['count']
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, random_state=0)
linreg2 = LinearRegression()
linreg2.fit(X2_train, y2_train)
y2_pred = linreg2.predict(X2_test)
print("RMSE2:", np.sqrt(metrics.mean_squared_error(y2_test,y2_pred)))

X3 = bikes[feature_cols3]
X3 = bikes[['temp', 'season', 'humidity']]
y3 = bikes['count']
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, random_state=0)
linreg3 = LinearRegression()
linreg3.fit(X3_train, y3_train)
y3_pred = linreg3.predict(X3_test)
print("RMSE3:", np.sqrt(metrics.mean_squared_error(y3_test,y3_pred)))

RMSE1: 155.56954726427747
RMSE2: 164.59320591359494
RMSE3: 155.62192767690075

补充：处理类别特征

有两种类别特征：

有序类别值： 转换成相应的数字值(例如: small=1, medium=2, large=3)
无序类别值： 使用dummy encoding (0/1编码)

此数据集中的类别特征有：

有序类别值： weather (已经被编码成相应的数字值1,2,3,4)
无序类别值： season (需要进行dummy encoding), holiday (已经被dummy encoded), workingday (已经被dummy encoded)

# create dummy variables
season_dummies = pd.get_dummies(bikes.season, prefix='season')

# print 5 random rows
season_dummies.sample(n=5, random_state=1)

	season_1	season_2	season_3	season_4
datetime
2011-09-05 11:00:00	0	0	1	0
2012-03-18 04:00:00	1	0	0	0
2012-10-14 17:00:00	0	0	0	1
2011-04-04 15:00:00	0	1	0	0
2012-12-11 02:00:00	0	0	0	1

我们只需要 三个 dummy 变量 (不是四个) （如果三个均为0，那就代表剩下的一个一定为1，所以保留三列和四列的效果是一样的）, 所以可以删除第一个dummy变量。

# drop the first column
season_dummies.drop(season_dummies.columns[0], axis=1, inplace=True)

# print 5 random rows
season_dummies.sample(n=5, random_state=1)

	season_2	season_3	season_4
datetime
2011-09-05 11:00:00	0	1	0
2012-03-18 04:00:00	0	0	0
2012-10-14 17:00:00	0	0	1
2011-04-04 15:00:00	1	0	0
2012-12-11 02:00:00	0	0	1

# concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns)
bikes = pd.concat([bikes, season_dummies], axis=1)

# print 5 random rows
bikes.sample(n=5, random_state=1)

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count	season_2	season_3	season_4
datetime
2011-09-05 11:00:00	3	1	0	2	28.70	33.335	74	11.0014	101	207	308	0	1	0
2012-03-18 04:00:00	1	0	0	2	17.22	21.210	94	11.0014	6	8	14	0	0	0
2012-10-14 17:00:00	4	0	0	1	26.24	31.060	44	12.9980	193	346	539	0	0	1
2011-04-04 15:00:00	2	0	1	1	31.16	33.335	23	36.9974	47	96	143	1	0	0
2012-12-11 02:00:00	4	0	1	2	16.40	20.455	66	22.0028	0	1	1	0	0	1

将编码成的dummy变量加入回归模型的特征，预测单车租赁数，并和前面的模型进行比较

# include dummy variables for season in the model
feature_cols = ['temp', 'season_2', 'season_3', 'season_4', 'humidity']

X5 = bikes[feature_cols]
X5 = bikes[['temp', 'season_2', 'season_3', 'season_4', 'humidity']]
y5 = bikes['count']
X5_train, X5_test, y5_train, y5_test = train_test_split(X5, y5, random_state=0)
linreg5 = LinearRegression()
linreg5.fit(X5_train, y5_train)
y5_pred = linreg5.predict(X5_test)

print("RMSE5:", np.sqrt(metrics.mean_squared_error(y5_test,y5_pred)))

RMSE5: 154.27800312889627

从前面多个模型中选出一个最佳的模型，添加多项式特征（degree=2），然后分别使用Ridge、Lasso和ElasticNet三种模型做预测，并比较。

# 添加多项式特征，degree=2
from sklearn.preprocessing import PolynomialFeatures

pf = PolynomialFeatures(degree=2)
X5_train_poly = pf.fit_transform(X5_train)
X5_test_poly = pf.fit_transform(X5_test)

# Ridge regression model
from sklearn.linear_model import Ridge

rr = Ridge(alpha=0.001)
rr.fit(X5_train_poly, y5_train)
y5_pred_rr = rr.predict(X5_test_poly)
print("Root mean square error (RMSE):", np.sqrt(metrics.mean_squared_error(y5_test,y5_pred_rr)))

Root mean square error (RMSE): 152.29954589083513

# Lasso regression model
from sklearn.linear_model import Lasso

lassor = Lasso(alpha=0.0001)
lassor.fit(X5_train_poly, y5_train)
y5_pred_lr = lassor.predict(X5_test_poly)
print("Root mean square error (RMSE):", np.sqrt(metrics.mean_squared_error(y5_test,y5_pred_lr)))

Root mean square error (RMSE): 152.29987641774662


C:\Users\86157\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 9.440e+07, tolerance: 2.680e+04
  model = cd_fast.enet_coordinate_descent(

# ElasticNet regression model
from sklearn.linear_model import ElasticNet

elasticnetr = ElasticNet(alpha=0.001)
elasticnetr.fit(X5_train_poly, y5_train)
y5_pred_er = elasticnetr.predict(X5_test_poly)
print("Root mean square error (RMSE):", np.sqrt(metrics.mean_squared_error(y5_test,y5_pred_er)))

Root mean square error (RMSE): 152.30721433139385


C:\Users\86157\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 9.443e+07, tolerance: 2.680e+04
  model = cd_fast.enet_coordinate_descent(