之前在准备Mathorcup的时候,觉着题目中的按小时变化的上下行流量呈现波峰波谷周期性的变化,而且大部分数据也具有随着时间迁移的平滑性,就想着使用Arima对这些数据进行建模预测分析。
  但是Arima建模过程中的参数(p,d,q),在每一个数据集中都是不一样的,相当于就是要对每一个数据集里的数据进行:自相关图->平稳性检验->白噪音检验->模型定阶,balabala一系列操作手动来确定arima(p,d,q)。
  当然这个过程太艰辛了,所以一开始是想对模型的p,d两个参数进行迭代赋值来确定AIC最小的model(网上也有dalao是把这个代码整出来了的,我放在文章后面),但是太慢了。。。也有可能是我自己的代码算法复杂度高了的原因或者啥的(循环写多了,超级后悔没去打蓝桥杯这些练一下缩减内存或者时间空间复杂度这一块)。
  想了想Python除了生孩子不行其他啥都能做,于是瞄上了全知全能的第三方库,让我找着了一个黑箱model的宏包,也就是pmdarima。其中内置auto_arima宏包可以自动迭代(p,d,q)三个参数,最终通过穷举所有可能的参数求得最小AIC的model,解决战斗。

一、Pmdarima宏包介绍

这里我也懒得翻译了,大概操作(install或者是import)和origin web这里面都有,自己看。

Pmdarima (originally pyramid-arima, for the anagram of ‘py’ + ‘arima’) is a statistical library designed to fill the void in Python’s time series analysis capabilities. This includes:

  • The equivalent of R’s auto.arima functionality
  • A collection of statistical tests of stationarity and seasonality
  • Time series utilities, such as differencing and inverse differencing
  • Numerous endogenous and exogenous transformers and featurizers, including Box-Cox and Fourier transformations
  • Seasonal time series decompositions
  • Cross-validation utilities
  • A rich collection of built-in time series datasets for prototyping and examples
  • Scikit-learn-esque pipelines to consolidate your estimators and promote productionization

Pmdarima wraps statsmodels under the hood, but is designed with an interface that’s familiar to users coming from a scikit-learn background.

Installation

Pmdarima has binary and source distributions for Windows, Mac and Linux (manylinux) on pypi under the package name pmdarima and can be downloaded via pip:

$ pip install pmdarima

Quickstart Examples

Fitting a simple auto-ARIMA on the wineind dataset:

import pmdarima as pm
from pmdarima.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

# Load/split your data
y = pm.datasets.load_wineind()
train, test = train_test_split(y, train_size=150)

# Fit your model
model = pm.auto_arima(train, seasonal=True, m=12)

# make your forecasts
forecasts = model.predict(test.shape[0])  # predict N steps into the future

# Visualize the forecasts (blue=train, green=forecasts)
x = np.arange(y.shape[0])
plt.plot(x[:150], train, c='blue')
plt.plot(x[150:], forecasts, c='green')
plt.show()

Fitting a more complex pipeline on the sunspots dataset, serializing it, and then loading it from disk to make predictions:

import pmdarima as pm
from pmdarima.model_selection import train_test_split
from pmdarima.pipeline import Pipeline
from pmdarima.preprocessing import BoxCoxEndogTransformer
import pickle

# Load/split your data
y = pm.datasets.load_sunspots()
train, test = train_test_split(y, train_size=2700)

# Define and fit your pipeline
pipeline = Pipeline([
    ('boxcox', BoxCoxEndogTransformer(lmbda2=1e-6)),  # lmbda2 avoids negative values
    ('arima', pm.AutoARIMA(seasonal=True, m=12,suppress_warnings=True,trace=True))])

pipeline.fit(train)

# Serialize your model just like you would in scikit:
with open('model.pkl', 'wb') as pkl:
    pickle.dump(pipeline, pkl)
    
# Load it and make predictions seamlessly:
with open('model.pkl', 'rb') as pkl:
    mod = pickle.load(pkl)
    print(mod.predict(15))
# [25.20580375 25.05573898 24.4263037  23.56766793 22.67463049 21.82231043
# 21.04061069 20.33693017 19.70906027 19.1509862  18.6555793  18.21577243
# 17.8250318  17.47750614 17.16803394]

二、Python代码实现

  当然上面也有举例,But还是大概拆分介绍一下这个model各个参数的调整已经函数的使用。

from pmdarima.arima import auto_arima      
	model1=auto_arima(data_low,start_p=1,start_q=1,max_p=3,max_q=3,m=12,start_P=0,seasonal=True,d=1,D=1,trace = True,error_action ='ignore',suppress_warnings = True,stepwise=True)
	model1.fit(data_low)
  • data_low:这是我的训练集
  • start_p:p参数迭代的初始值
  • max_p:p参数迭代的最大值
  • seasonal:季节性
  • trace:平滑
  • stepwise:显示运行过程

三、导出模型

  这里也提一下如何save model,这里选择引用joblib宏包里的model保存功能(注:这个宏包以前是包含在sklearn宏包里的一项功能,后来我更新anaconda的时候好像顺手更新了sklearn,这个功能就只能单独引用了)。

import joblib
joblib.dump(model2,'model_save/'+str(i)+'.pkl')

四、优缺点

优点大概就是:

  1. 相对于写自己写循环不优化算法而言这个方法真的快多了
  2. 全自动不解释,真的做到解放双手,电脑无脑跑

缺点也还蛮多:

  1. 一个model的平均训练时长大概是1min左右
  2. 无法通过识别白噪音点,选取降阶的model模型(还是转不过弯?)

五、Pmdarima下载链接&&遍历赋值(p,q)代码

pmdarima宏包:https://download.csdn.net/download/weixin_45839604/14075267

遍历赋值算法结构:

from statsmodels.tsa.arima_model import ARIMA

pmax = int(len(data['low_GB'])/10)    #一般阶数不超过 length /10
qmax = int(len(data['on_GB'])/10)

bic_matrix = []

for p in range(pmax+1):
    temp= []
    for q in range(qmax+1):
        try:
            temp.append(ARIMA(data_on,(p, 1, q)).fit().bic)
        except:
            temp.append(None)
        bic_matrix.append(temp)

bic_matrix = pd.DataFrame(bic_matrix)   #将其转换成Dataframe 数据结构
p,q = bic_matrix.stack().astype('float64').idxmin()   #先使用stack 展平, 然后使用 idxmin 找出最小值的位置
print(u'BIC 最小的p值 和 q 值:%s,%s' %(p,q))  #  BIC 最小的p值 和 q 值:0,1
#所以可以建立ARIMA 模型,ARIMA(0,1,1)
model = ARIMA(data_on,(p,1,q)).fit()
# model.summary2()        #生成一份模型报告
# model.forecast(5)   #为未来5天进行预测, 返回预测结果, 标准误差, 和置信区间

参考文献:

[1] Pmdarima wraps statsmodels

Logo

开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!

更多推荐