基于多元线性回归模型的单车租赁数预测模型
本练习将基于kaggle竞赛中的sharebike数据集建立预测单车租赁数的多元线性回归模型,并通过RMSE,MSE等不同指标对模型进行评价。
·
线性回归练习
在这个练习中,我们使用一个Kaggle竞赛中提供的共享单车的数据集:Bike Sharing Demand。
该数据集包含2011到2012年Capital Bikeshare系统中记录的每日每小时单车的租赁数,以及相应的季节和气候等信息。
数据列:
- datetime - hourly date + timestamp
- season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
- holiday - whether the day is considered a holiday
- workingday - whether the day is neither a weekend nor holiday
- weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy;2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist;3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds;4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp - temperature in Celsius
- atemp - “feels like” temperature in Celsius
- humidity - relative humidity
- windspeed - wind speed
- casual - number of non-registered user rentals initiated
- registered - number of registered user rentals initiated
- count - number of total rentals
第一步:读入数据
# read the data and set the datetime as the index
import pandas as pd
bikes = pd.read_csv('bikeshare.csv', index_col='datetime', parse_dates=True)
bikes.head()
season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
---|---|---|---|---|---|---|---|---|---|---|---|
datetime | |||||||||||
2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 |
2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 |
2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 |
2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 |
2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 |
第二步:可视化数据
- 用matplotlib画出温度“temp”和自行车租赁数“count”之间的散点图;
- 用seborn画出温度“temp”和自行车租赁数“count”之间带线性关系的散点图(提示:使用seaborn中的lmplot绘制)
# matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(1, figsize = (10, 25))
tem = bikes["temp"]
cou = bikes["count"]
plt.ylabel("number of total rentals")
plt.xlabel("temperature in Celsius")
plt.scatter(tem, cou, s = 5)
<matplotlib.collections.PathCollection at 0x1970d87a340>
# seaborn
import seaborn as sns
sns.lmplot(x = "temp", y = "count", data = bikes, height = 20, aspect = 1)
<seaborn.axisgrid.FacetGrid at 0x1970d056160>
第三步:一元线性回归
用温度预测自行车租赁数
# create X and y
feature_cols0 = ["temp"]
X0 = bikes[feature_cols0]
X0 = bikes[["temp"]]
y0 = bikes["count"]
# import, instantiate, fit
from sklearn.model_selection import train_test_split
X0_train, X0_test, y0_train, y0_test = train_test_split(X0, y0, random_state = 0)
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X0_train, y0_train)
LinearRegression()
# print the coefficients
list(zip(feature_cols0, linreg.coef_))
[('temp', 9.139291387753524)]
第四步:探索多个特征
# explore more features
feature_cols = ['temp', 'season', 'weather', 'humidity']
# using seaborn, draw multiple scatter plots between each feature in feature_cols and 'count'
sns.pairplot(bikes, x_vars = feature_cols, y_vars = 'count', height = 7, aspect = 0.7)
<seaborn.axisgrid.PairGrid at 0x1970d0561f0>
# correlation matrix (ranges from 1 to -1)
bikes.corr()
season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
---|---|---|---|---|---|---|---|---|---|---|---|
season | 1.000000 | 0.029368 | -0.008126 | 0.008879 | 0.258689 | 0.264744 | 0.190610 | -0.147121 | 0.096758 | 0.164011 | 0.163439 |
holiday | 0.029368 | 1.000000 | -0.250491 | -0.007074 | 0.000295 | -0.005215 | 0.001929 | 0.008409 | 0.043799 | -0.020956 | -0.005393 |
workingday | -0.008126 | -0.250491 | 1.000000 | 0.033772 | 0.029966 | 0.024660 | -0.010880 | 0.013373 | -0.319111 | 0.119460 | 0.011594 |
weather | 0.008879 | -0.007074 | 0.033772 | 1.000000 | -0.055035 | -0.055376 | 0.406244 | 0.007261 | -0.135918 | -0.109340 | -0.128655 |
temp | 0.258689 | 0.000295 | 0.029966 | -0.055035 | 1.000000 | 0.984948 | -0.064949 | -0.017852 | 0.467097 | 0.318571 | 0.394454 |
atemp | 0.264744 | -0.005215 | 0.024660 | -0.055376 | 0.984948 | 1.000000 | -0.043536 | -0.057473 | 0.462067 | 0.314635 | 0.389784 |
humidity | 0.190610 | 0.001929 | -0.010880 | 0.406244 | -0.064949 | -0.043536 | 1.000000 | -0.318607 | -0.348187 | -0.265458 | -0.317371 |
windspeed | -0.147121 | 0.008409 | 0.013373 | 0.007261 | -0.017852 | -0.057473 | -0.318607 | 1.000000 | 0.092276 | 0.091052 | 0.101369 |
casual | 0.096758 | 0.043799 | -0.319111 | -0.135918 | 0.467097 | 0.462067 | -0.348187 | 0.092276 | 1.000000 | 0.497250 | 0.690414 |
registered | 0.164011 | -0.020956 | 0.119460 | -0.109340 | 0.318571 | 0.314635 | -0.265458 | 0.091052 | 0.497250 | 1.000000 | 0.970948 |
count | 0.163439 | -0.005393 | 0.011594 | -0.128655 | 0.394454 | 0.389784 | -0.317371 | 0.101369 | 0.690414 | 0.970948 | 1.000000 |
sns.heatmap(bikes.corr())
<AxesSubplot:>
用’temp’, ‘season’, ‘weather’, ‘humidity’四个特征预测单车租赁数’count’
# create X and y
X4 = bikes[feature_cols]
X4 = bikes[['temp', 'season', 'weather', 'humidity']]
y4 = bikes['count']
# import, instantiate, fit
from sklearn.model_selection import train_test_split
X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, random_state=0)
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X4_train, y4_train)
LinearRegression()
# print the coefficients
list(zip(feature_cols, linreg.coef_))
[('temp', 7.801241497301735),
('season', 23.781793850855003),
('weather', 6.597649302354395),
('humidity', -3.120711730873758)]
使用train/test split和RMSE来比较多个不同的模型
# compare different sets of features
feature_cols1 = ['temp', 'season', 'weather', 'humidity']
feature_cols2 = ['temp', 'season', 'weather']
feature_cols3 = ['temp', 'season', 'humidity']
import numpy as np
from sklearn import metrics
X1 = bikes[feature_cols1]
X1 = bikes[['temp', 'season', 'weather', 'humidity']]
y1 = bikes['count']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, random_state=0)
linreg1 = LinearRegression()
linreg1.fit(X1_train, y1_train)
y1_pred = linreg1.predict(X1_test)
print("RMSE1:", np.sqrt(metrics.mean_squared_error(y1_test,y1_pred)))
X2 = bikes[feature_cols2]
X2 = bikes[['temp', 'season', 'weather']]
y2 = bikes['count']
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, random_state=0)
linreg2 = LinearRegression()
linreg2.fit(X2_train, y2_train)
y2_pred = linreg2.predict(X2_test)
print("RMSE2:", np.sqrt(metrics.mean_squared_error(y2_test,y2_pred)))
X3 = bikes[feature_cols3]
X3 = bikes[['temp', 'season', 'humidity']]
y3 = bikes['count']
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, random_state=0)
linreg3 = LinearRegression()
linreg3.fit(X3_train, y3_train)
y3_pred = linreg3.predict(X3_test)
print("RMSE3:", np.sqrt(metrics.mean_squared_error(y3_test,y3_pred)))
RMSE1: 155.56954726427747
RMSE2: 164.59320591359494
RMSE3: 155.62192767690075
补充:处理类别特征
有两种类别特征:
- 有序类别值: 转换成相应的数字值(例如: small=1, medium=2, large=3)
- 无序类别值: 使用dummy encoding (0/1编码)
此数据集中的类别特征有:
- 有序类别值: weather (已经被编码成相应的数字值1,2,3,4)
- 无序类别值: season (需要进行dummy encoding), holiday (已经被dummy encoded), workingday (已经被dummy encoded)
# create dummy variables
season_dummies = pd.get_dummies(bikes.season, prefix='season')
# print 5 random rows
season_dummies.sample(n=5, random_state=1)
season_1 | season_2 | season_3 | season_4 | |
---|---|---|---|---|
datetime | ||||
2011-09-05 11:00:00 | 0 | 0 | 1 | 0 |
2012-03-18 04:00:00 | 1 | 0 | 0 | 0 |
2012-10-14 17:00:00 | 0 | 0 | 0 | 1 |
2011-04-04 15:00:00 | 0 | 1 | 0 | 0 |
2012-12-11 02:00:00 | 0 | 0 | 0 | 1 |
我们只需要 三个 dummy 变量 (不是四个) (如果三个均为0,那就代表剩下的一个一定为1,所以保留三列和四列的效果是一样的), 所以可以删除第一个dummy变量。
# drop the first column
season_dummies.drop(season_dummies.columns[0], axis=1, inplace=True)
# print 5 random rows
season_dummies.sample(n=5, random_state=1)
season_2 | season_3 | season_4 | |
---|---|---|---|
datetime | |||
2011-09-05 11:00:00 | 0 | 1 | 0 |
2012-03-18 04:00:00 | 0 | 0 | 0 |
2012-10-14 17:00:00 | 0 | 0 | 1 |
2011-04-04 15:00:00 | 1 | 0 | 0 |
2012-12-11 02:00:00 | 0 | 0 | 1 |
# concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns)
bikes = pd.concat([bikes, season_dummies], axis=1)
# print 5 random rows
bikes.sample(n=5, random_state=1)
season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | season_2 | season_3 | season_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
datetime | ||||||||||||||
2011-09-05 11:00:00 | 3 | 1 | 0 | 2 | 28.70 | 33.335 | 74 | 11.0014 | 101 | 207 | 308 | 0 | 1 | 0 |
2012-03-18 04:00:00 | 1 | 0 | 0 | 2 | 17.22 | 21.210 | 94 | 11.0014 | 6 | 8 | 14 | 0 | 0 | 0 |
2012-10-14 17:00:00 | 4 | 0 | 0 | 1 | 26.24 | 31.060 | 44 | 12.9980 | 193 | 346 | 539 | 0 | 0 | 1 |
2011-04-04 15:00:00 | 2 | 0 | 1 | 1 | 31.16 | 33.335 | 23 | 36.9974 | 47 | 96 | 143 | 1 | 0 | 0 |
2012-12-11 02:00:00 | 4 | 0 | 1 | 2 | 16.40 | 20.455 | 66 | 22.0028 | 0 | 1 | 1 | 0 | 0 | 1 |
将编码成的dummy变量加入回归模型的特征,预测单车租赁数,并和前面的模型进行比较
# include dummy variables for season in the model
feature_cols = ['temp', 'season_2', 'season_3', 'season_4', 'humidity']
X5 = bikes[feature_cols]
X5 = bikes[['temp', 'season_2', 'season_3', 'season_4', 'humidity']]
y5 = bikes['count']
X5_train, X5_test, y5_train, y5_test = train_test_split(X5, y5, random_state=0)
linreg5 = LinearRegression()
linreg5.fit(X5_train, y5_train)
y5_pred = linreg5.predict(X5_test)
print("RMSE5:", np.sqrt(metrics.mean_squared_error(y5_test,y5_pred)))
RMSE5: 154.27800312889627
从前面多个模型中选出一个最佳的模型,添加多项式特征(degree=2),然后分别使用Ridge、Lasso和ElasticNet三种模型做预测,并比较。
# 添加多项式特征,degree=2
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=2)
X5_train_poly = pf.fit_transform(X5_train)
X5_test_poly = pf.fit_transform(X5_test)
# Ridge regression model
from sklearn.linear_model import Ridge
rr = Ridge(alpha=0.001)
rr.fit(X5_train_poly, y5_train)
y5_pred_rr = rr.predict(X5_test_poly)
print("Root mean square error (RMSE):", np.sqrt(metrics.mean_squared_error(y5_test,y5_pred_rr)))
Root mean square error (RMSE): 152.29954589083513
# Lasso regression model
from sklearn.linear_model import Lasso
lassor = Lasso(alpha=0.0001)
lassor.fit(X5_train_poly, y5_train)
y5_pred_lr = lassor.predict(X5_test_poly)
print("Root mean square error (RMSE):", np.sqrt(metrics.mean_squared_error(y5_test,y5_pred_lr)))
Root mean square error (RMSE): 152.29987641774662
C:\Users\86157\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 9.440e+07, tolerance: 2.680e+04
model = cd_fast.enet_coordinate_descent(
# ElasticNet regression model
from sklearn.linear_model import ElasticNet
elasticnetr = ElasticNet(alpha=0.001)
elasticnetr.fit(X5_train_poly, y5_train)
y5_pred_er = elasticnetr.predict(X5_test_poly)
print("Root mean square error (RMSE):", np.sqrt(metrics.mean_squared_error(y5_test,y5_pred_er)))
Root mean square error (RMSE): 152.30721433139385
C:\Users\86157\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 9.443e+07, tolerance: 2.680e+04
model = cd_fast.enet_coordinate_descent(
开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!
更多推荐
已为社区贡献1条内容
所有评论(0)