线性回归练习

在这个练习中,我们使用一个Kaggle竞赛中提供的共享单车的数据集:Bike Sharing Demand
该数据集包含2011到2012年Capital Bikeshare系统中记录的每日每小时单车的租赁数,以及相应的季节和气候等信息。

数据列:

  • datetime - hourly date + timestamp
  • season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
  • holiday - whether the day is considered a holiday
  • workingday - whether the day is neither a weekend nor holiday
  • weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy;2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist;3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds;4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp - temperature in Celsius
  • atemp - “feels like” temperature in Celsius
  • humidity - relative humidity
  • windspeed - wind speed
  • casual - number of non-registered user rentals initiated
  • registered - number of registered user rentals initiated
  • count - number of total rentals

第一步:读入数据

# read the data and set the datetime as the index
import pandas as pd

bikes = pd.read_csv('bikeshare.csv', index_col='datetime', parse_dates=True)
bikes.head()
seasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcount
datetime
2011-01-01 00:00:0010019.8414.395810.031316
2011-01-01 01:00:0010019.0213.635800.083240
2011-01-01 02:00:0010019.0213.635800.052732
2011-01-01 03:00:0010019.8414.395750.031013
2011-01-01 04:00:0010019.8414.395750.0011

第二步:可视化数据

  • 用matplotlib画出温度“temp”和自行车租赁数“count”之间的散点图;
  • 用seborn画出温度“temp”和自行车租赁数“count”之间带线性关系的散点图(提示:使用seaborn中的lmplot绘制)
# matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(1, figsize = (10, 25))

tem = bikes["temp"]

cou = bikes["count"]

plt.ylabel("number of total rentals")

plt.xlabel("temperature in Celsius")

plt.scatter(tem, cou, s = 5)
<matplotlib.collections.PathCollection at 0x1970d87a340>

在这里插入图片描述

# seaborn
import seaborn as sns

sns.lmplot(x = "temp", y = "count", data = bikes, height = 20, aspect = 1)
<seaborn.axisgrid.FacetGrid at 0x1970d056160>

在这里插入图片描述

第三步:一元线性回归

用温度预测自行车租赁数

# create X and y
feature_cols0 = ["temp"]

X0 = bikes[feature_cols0]

X0 = bikes[["temp"]]

y0 = bikes["count"]
# import, instantiate, fit
from sklearn.model_selection import train_test_split

X0_train, X0_test, y0_train, y0_test = train_test_split(X0, y0, random_state = 0)

from sklearn.linear_model import LinearRegression

linreg = LinearRegression()

linreg.fit(X0_train, y0_train)
LinearRegression()
# print the coefficients

list(zip(feature_cols0, linreg.coef_))
[('temp', 9.139291387753524)]

第四步:探索多个特征

# explore more features
feature_cols = ['temp', 'season', 'weather', 'humidity']
# using seaborn, draw multiple scatter plots between each feature in feature_cols and 'count'
sns.pairplot(bikes, x_vars = feature_cols, y_vars = 'count', height = 7, aspect = 0.7)
<seaborn.axisgrid.PairGrid at 0x1970d0561f0>

在这里插入图片描述

# correlation matrix (ranges from 1 to -1)
bikes.corr()
seasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcount
season1.0000000.029368-0.0081260.0088790.2586890.2647440.190610-0.1471210.0967580.1640110.163439
holiday0.0293681.000000-0.250491-0.0070740.000295-0.0052150.0019290.0084090.043799-0.020956-0.005393
workingday-0.008126-0.2504911.0000000.0337720.0299660.024660-0.0108800.013373-0.3191110.1194600.011594
weather0.008879-0.0070740.0337721.000000-0.055035-0.0553760.4062440.007261-0.135918-0.109340-0.128655
temp0.2586890.0002950.029966-0.0550351.0000000.984948-0.064949-0.0178520.4670970.3185710.394454
atemp0.264744-0.0052150.024660-0.0553760.9849481.000000-0.043536-0.0574730.4620670.3146350.389784
humidity0.1906100.001929-0.0108800.406244-0.064949-0.0435361.000000-0.318607-0.348187-0.265458-0.317371
windspeed-0.1471210.0084090.0133730.007261-0.017852-0.057473-0.3186071.0000000.0922760.0910520.101369
casual0.0967580.043799-0.319111-0.1359180.4670970.462067-0.3481870.0922761.0000000.4972500.690414
registered0.164011-0.0209560.119460-0.1093400.3185710.314635-0.2654580.0910520.4972501.0000000.970948
count0.163439-0.0053930.011594-0.1286550.3944540.389784-0.3173710.1013690.6904140.9709481.000000
sns.heatmap(bikes.corr())
<AxesSubplot:>

在这里插入图片描述

用’temp’, ‘season’, ‘weather’, ‘humidity’四个特征预测单车租赁数’count’

# create X and y
X4 = bikes[feature_cols]

X4 = bikes[['temp', 'season', 'weather', 'humidity']]

y4 = bikes['count']
# import, instantiate, fit
from sklearn.model_selection import train_test_split

X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, random_state=0)

from sklearn.linear_model import LinearRegression

linreg = LinearRegression()

linreg.fit(X4_train, y4_train)
LinearRegression()
# print the coefficients
list(zip(feature_cols, linreg.coef_))
[('temp', 7.801241497301735),
 ('season', 23.781793850855003),
 ('weather', 6.597649302354395),
 ('humidity', -3.120711730873758)]

使用train/test split和RMSE来比较多个不同的模型

# compare different sets of features
feature_cols1 = ['temp', 'season', 'weather', 'humidity']
feature_cols2 = ['temp', 'season', 'weather']
feature_cols3 = ['temp', 'season', 'humidity']

import numpy as np

from sklearn import metrics

X1 = bikes[feature_cols1]
X1 = bikes[['temp', 'season', 'weather', 'humidity']]
y1 = bikes['count']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, random_state=0)
linreg1 = LinearRegression()
linreg1.fit(X1_train, y1_train)
y1_pred = linreg1.predict(X1_test)

print("RMSE1:", np.sqrt(metrics.mean_squared_error(y1_test,y1_pred)))

X2 = bikes[feature_cols2]
X2 = bikes[['temp', 'season', 'weather']]
y2 = bikes['count']
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, random_state=0)
linreg2 = LinearRegression()
linreg2.fit(X2_train, y2_train)
y2_pred = linreg2.predict(X2_test)
print("RMSE2:", np.sqrt(metrics.mean_squared_error(y2_test,y2_pred)))

X3 = bikes[feature_cols3]
X3 = bikes[['temp', 'season', 'humidity']]
y3 = bikes['count']
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, random_state=0)
linreg3 = LinearRegression()
linreg3.fit(X3_train, y3_train)
y3_pred = linreg3.predict(X3_test)
print("RMSE3:", np.sqrt(metrics.mean_squared_error(y3_test,y3_pred)))
RMSE1: 155.56954726427747
RMSE2: 164.59320591359494
RMSE3: 155.62192767690075

补充:处理类别特征

有两种类别特征:

  • 有序类别值: 转换成相应的数字值(例如: small=1, medium=2, large=3)
  • 无序类别值: 使用dummy encoding (0/1编码)

此数据集中的类别特征有:

  • 有序类别值: weather (已经被编码成相应的数字值1,2,3,4)
  • 无序类别值: season (需要进行dummy encoding), holiday (已经被dummy encoded), workingday (已经被dummy encoded)
# create dummy variables
season_dummies = pd.get_dummies(bikes.season, prefix='season')

# print 5 random rows
season_dummies.sample(n=5, random_state=1)
season_1season_2season_3season_4
datetime
2011-09-05 11:00:000010
2012-03-18 04:00:001000
2012-10-14 17:00:000001
2011-04-04 15:00:000100
2012-12-11 02:00:000001

我们只需要 三个 dummy 变量 (不是四个) (如果三个均为0,那就代表剩下的一个一定为1,所以保留三列和四列的效果是一样的), 所以可以删除第一个dummy变量。

# drop the first column
season_dummies.drop(season_dummies.columns[0], axis=1, inplace=True)

# print 5 random rows
season_dummies.sample(n=5, random_state=1)
season_2season_3season_4
datetime
2011-09-05 11:00:00010
2012-03-18 04:00:00000
2012-10-14 17:00:00001
2011-04-04 15:00:00100
2012-12-11 02:00:00001
# concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns)
bikes = pd.concat([bikes, season_dummies], axis=1)

# print 5 random rows
bikes.sample(n=5, random_state=1)
seasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcountseason_2season_3season_4
datetime
2011-09-05 11:00:00310228.7033.3357411.0014101207308010
2012-03-18 04:00:00100217.2221.2109411.00146814000
2012-10-14 17:00:00400126.2431.0604412.9980193346539001
2011-04-04 15:00:00201131.1633.3352336.99744796143100
2012-12-11 02:00:00401216.4020.4556622.0028011001

将编码成的dummy变量加入回归模型的特征,预测单车租赁数,并和前面的模型进行比较

# include dummy variables for season in the model
feature_cols = ['temp', 'season_2', 'season_3', 'season_4', 'humidity']

X5 = bikes[feature_cols]
X5 = bikes[['temp', 'season_2', 'season_3', 'season_4', 'humidity']]
y5 = bikes['count']
X5_train, X5_test, y5_train, y5_test = train_test_split(X5, y5, random_state=0)
linreg5 = LinearRegression()
linreg5.fit(X5_train, y5_train)
y5_pred = linreg5.predict(X5_test)

print("RMSE5:", np.sqrt(metrics.mean_squared_error(y5_test,y5_pred)))
RMSE5: 154.27800312889627

从前面多个模型中选出一个最佳的模型,添加多项式特征(degree=2),然后分别使用Ridge、Lasso和ElasticNet三种模型做预测,并比较。

# 添加多项式特征,degree=2
from sklearn.preprocessing import PolynomialFeatures

pf = PolynomialFeatures(degree=2)
X5_train_poly = pf.fit_transform(X5_train)
X5_test_poly = pf.fit_transform(X5_test)
# Ridge regression model
from sklearn.linear_model import Ridge

rr = Ridge(alpha=0.001)
rr.fit(X5_train_poly, y5_train)
y5_pred_rr = rr.predict(X5_test_poly)
print("Root mean square error (RMSE):", np.sqrt(metrics.mean_squared_error(y5_test,y5_pred_rr)))
Root mean square error (RMSE): 152.29954589083513
# Lasso regression model
from sklearn.linear_model import Lasso

lassor = Lasso(alpha=0.0001)
lassor.fit(X5_train_poly, y5_train)
y5_pred_lr = lassor.predict(X5_test_poly)
print("Root mean square error (RMSE):", np.sqrt(metrics.mean_squared_error(y5_test,y5_pred_lr)))
Root mean square error (RMSE): 152.29987641774662


C:\Users\86157\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 9.440e+07, tolerance: 2.680e+04
  model = cd_fast.enet_coordinate_descent(
# ElasticNet regression model
from sklearn.linear_model import ElasticNet

elasticnetr = ElasticNet(alpha=0.001)
elasticnetr.fit(X5_train_poly, y5_train)
y5_pred_er = elasticnetr.predict(X5_test_poly)
print("Root mean square error (RMSE):", np.sqrt(metrics.mean_squared_error(y5_test,y5_pred_er)))
Root mean square error (RMSE): 152.30721433139385


C:\Users\86157\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 9.443e+07, tolerance: 2.680e+04
  model = cd_fast.enet_coordinate_descent(
Logo

开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!

更多推荐