神经网络优化器：RMSProp——优化神经网络训练的梯度下降算法的变体

它使用梯度的衰减移动平均值来忘记较早的梯度，同时在计算中优先考虑较新的梯度。RProp，即弹性传播，是为了解决梯度大小不同的问题而引入的。RMSProp 是梯度下降的一种改进形式，它使用衰减的移动平均线，而不仅仅是当前值。因此，理想情况下，我们需要一种带有移动平均滤波器的技术来克服 RProp 的问题，同时仍然保持 RProp 的鲁棒性和高效性。值是在学习率项中除以的，因此如果 Vdw 较低，则学

新华

2164人浏览 · 2023-12-25 09:35:31

新华 · 2023-12-25 09:35:31 发布

概述

RMSProp 代表均方根传播，采用梯度下降算法进行优化。它属于自适应优化算法的范畴。这些自适应优化器带来了更快、更高效的学习时间。我们将通过首先查看 RProp，了解其缺陷，然后了解 RMSProp 如何解决这些缺陷来研究 RMSProp。

介绍

反向传播算法之父 Geoffrey Hinton 首先提出了 RMSProp。RMSProp 的工作原理是在每次迭代时改变步长，以提供更好的结果。它抑制了振荡并更快地达到最小值。它使用梯度的衰减移动平均值来忘记较早的梯度，同时在计算中优先考虑较新的梯度。

Rprop 到 RMSprop

RProp，即弹性传播，是为了解决梯度大小不同的问题而引入的。它引入了自适应学习率，通过查看前两个梯度符号来解决这个问题。RProp 的工作原理是分别比较前一个梯度和当前梯度的符号并调整学习率。下面的伪代码将提供更好的理解。

for i in range(n_iter):
if dW[t-1]*dW[t] >0:
	step_size = min(step_size * increment_factor, max_step_size)
elif dW[t-1]*dW[t] <0:
	step_size = max(step_size * decrement_factor, min_step_size)
w[t] = w[t - 1] - sign(dw[t]) * step_size

w[t] -> 权重

dw -> 相对于权重的梯度

如果先前和当前梯度具有相同的符号，则学习率会加快（乘以增量因子），通常是介于 1 和 2 之间的数字。如果符号不同，则学习率会减慢递减因子，通常为 0.5。

RProp 的问题在于它不能很好地用于小批量，因为它不符合小批量梯度下降的核心思想。当学习率足够低时，它会使用连续小批量梯度的平均值。这不会在 RProp 中应用。例如，如果有 9 个量级为 +0.1 的 +ve 梯度，而第 10 个梯度为 -0.9，理想情况下，我们希望对梯度进行平均并相互抵消。但在 RProp 中，梯度递增 9 倍，递减 1 次，从而获得更高值的梯度。

因此，理想情况下，我们需要一种带有移动平均滤波器的技术来克服 RProp 的问题，同时仍然保持 RProp 的鲁棒性和高效性。这就引出了 RMSProp。

均方根传播（RMSProp）

均方根传播通过使用梯度平方的移动平均线除以梯度移动平均线的平方根来减少振荡。

上面的等式用于 RMSProp。在第一个等式中，Vdw 使用平均值计算步长。变量 Vdw 包含充当移动平均线的先前梯度的信息。参数β控制在第一项中必须考虑的先前参数的数量。适用于大多数实际应用β = 0.9.值 0.9 表示大致考虑前 10 项。在第二项中，通过考虑当前梯度的平方来减少振荡。直观地说，由于 Vdw 值是在学习率项中除以的，因此如果 Vdw 较低，则学习率增加，反之亦然。因此，如果梯度很高，这种技术会使其平滑，从而减少振荡。

使用 RMSProp 进行梯度下降

梯度下降是一种用于训练机器学习模型的优化算法。它迭代检查梯度以找到成本函数的最小值。RMSProp 是梯度下降的一种改进形式，它使用衰减的移动平均线，而不仅仅是当前值。

二维测试问题

我们将选择一个客观测试函数。

目标函数 =

为简单起见，我们将在 -1 和 +1 之间选择示例的边界。

采样率选择为 0.1，获得的值绘制在 3D 模型中并绘制为等值线图。

from the numpy import arange,meshgrid
import matplotlib.pyplot as plt
 
# a simple second degree function
def function(x, y):
 return x**2.0 + y**2.0
 
# inputs sampled at each 0.1 interval
xaxis = arange(-1, 1, 0.1)
yaxis = arange(-1, 1, 0.1)
# meshgrid from created x and y axis
x, y = mesh grid(x-axis, y-axis)
# compute gradients wrt w
results = function(x, y)

#create a 3d plot
figure = plt.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
plt.show()
# plot a filled contour with 100 levels
plt.contourf(x, y, results, levels=100, cmap='jet')
plt.show()

使用 RMSProp 进行梯度下降优化

首先，我们实现了与之前相同的目标函数，以及梯度函数。的导数

from math import sqrt
from numpy import asarray,arange,meshgrid
from numpy.random import rand, seed
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# target function is a simple second-degree equation
def objective_function(x, y):
	return x**2.0 + y**2.0

# function to find the derivative of target function x**2 +y**2
def derivative_function(x, y):
	return array([x * 2.0, y * 2.0])

接下来，我们定义 RMSProp 模型。但首先，我们初始化起点和平方平均值。

# RMSProp algorithm
def RMSProp(objective_function, derivative_function, values_range, n_iterations, step_size, Beta):
	# list of all solution points
	all_solutions_list = list()
	# initial point generation within the range
	current_solution_point = values_range[:, 0] + rand(len(values_range)) * (values_range[:, 1] - values_range[:, 0])
	# squared gradients average
	squared_gradient_avg = [0.0 for _ in range(values_range.shape[0])]

我们创建一个 for 循环来定义迭代次数。然后，我们计算梯度并创建另一个 for 循环来计算每个变量的梯度平均值的平方。

	#Gradient Descent Algorithm looped n times
	for n in range(n_iterations):
		
		gradient = derivative_function(current_solution_point[0], current_solution_point[1])
		# loop to calculate an average of the squared gradients
		for i in range(gradient.shape[0]):
			squared_gradient = gradient[i]**2.0
			squared_gradient_avg[i] = (squared_gradient_avg[i] * Beta) + (squared_gradient * (1.0-Beta))

创建另一个循环来更新每个变量的学习率（alpha），并更新相应的权重。

# update solution point with squared gradient average
		updated_solution = list()
		for j in range(current_solution_point.shape[0]):
			# Learning rate calculation
			alpha = step_size / (1e-8 + sqrt(squared_gradient_avg[j]))
			# update solution point
			updated_value = current_solution_point[i] - alpha * gradient[j]
			updated_solution.append(updated_value)

我们将解决方案附加到列表中，迭代完成后，打印出结果并返回解决方案。

		# store updated solution
		current_solution_point = asarray(updated_solution)
		all_solutions_list.append(current_solution_point)
		# value of the function at updated solution point
		solution_eval_value = objective_function(current_solution_point[0], current_solution_point[1])

		print('iteration: %d  |  (x,y) = (%s)  |  eval_value = %.5f' % (n, current_solution_point, solution_eval_value))
	return all_solutions_list

我们初始化随机种子、边界、beta 和步长。然后，将这些值传递给 RMSProp 模型。

# seed random number generator with any value
seed(1)
# min and max bounds of the input
values_range = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# number of iterations
n_iterations = 50
# step size 
step_size = 0.01
# momentum for rmsprop, 0.99 works for most use cases
Beta = 0.99
# RMSProp gradient Descent
solutions = RMSProp(objective_function, derivative_function, values_range, n_iterations, step_size, Beta)

得到以下结果。

输出：

iteration: 0  |  (x,y) = ([0.54064896 0.340649  ])  |  eval_value = 0.40834
iteration: 1  |  (x,y) = ([0.24501018 0.27929536])  |  eval_value = 0.13804
iteration: 2  |  (x,y) = ([0.23935992 0.23417693])  |  eval_value = 0.11213
iteration: 3  |  (x,y) = ([0.19767195 0.19863849])  |  eval_value = 0.07853
iteration: 4  |  (x,y) = ([0.16964129 0.16964301])  |  eval_value = 0.05756
iteration: 5  |  (x,y) = ([0.14537973 0.145492  ])  |  eval_value = 0.04230
iteration: 6  |  (x,y) = ([0.12503593 0.12511178])  |  eval_value = 0.03129
iteration: 7  |  (x,y) = ([0.10769972 0.10776512])  |  eval_value = 0.02321
iteration: 8  |  (x,y) = ([0.09286006 0.09291479])  |  eval_value = 0.01726
iteration: 9  |  (x,y) = ([0.08010512 0.08015162])  |  eval_value = 0.01284
iteration: 10  |  (x,y) = ([0.06911364 0.06915332])  |  eval_value = 0.00956
iteration: 11  |  (x,y) = ([0.05962546 0.05965946])  |  eval_value = 0.00711
iteration: 12  |  (x,y) = ([0.05142626 0.05145547])  |  eval_value = 0.00529
iteration: 13  |  (x,y) = ([0.04433676 0.04436191])  |  eval_value = 0.00393
iteration: 14  |  (x,y) = ([0.03820535 0.03822702])  |  eval_value = 0.00292
iteration: 15  |  (x,y) = ([0.0329027  0.03292139])  |  eval_value = 0.00217
iteration: 16  |  (x,y) = ([0.02831784 0.02833396])  |  eval_value = 0.00160
iteration: 17  |  (x,y) = ([0.02435509 0.02436899])  |  eval_value = 0.00119
iteration: 18  |  (x,y) = ([0.02093171 0.0209437 ])  |  eval_value = 0.00088
iteration: 19  |  (x,y) = ([0.01797599 0.01798633])  |  eval_value = 0.00065
iteration: 20  |  (x,y) = ([0.01542569 0.0154346 ])  |  eval_value = 0.00048
iteration: 21  |  (x,y) = ([0.01322671 0.01323438])  |  eval_value = 0.00035
iteration: 22  |  (x,y) = ([0.01133204 0.01133864])  |  eval_value = 0.00026
iteration: 23  |  (x,y) = ([0.00970081 0.00970649])  |  eval_value = 0.00019
iteration: 24  |  (x,y) = ([0.0082975  0.00830238])  |  eval_value = 0.00014
iteration: 25  |  (x,y) = ([0.00709123 0.00709542])  |  eval_value = 0.00010
iteration: 26  |  (x,y) = ([0.00605519 0.00605879])  |  eval_value = 0.00007
iteration: 27  |  (x,y) = ([0.00516609 0.00516918])  |  eval_value = 0.00005
iteration: 28  |  (x,y) = ([0.00440374 0.00440639])  |  eval_value = 0.00004
iteration: 29  |  (x,y) = ([0.00375063 0.0037529 ])  |  eval_value = 0.00003
iteration: 30  |  (x,y) = ([0.00319159 0.00319353])  |  eval_value = 0.00002
iteration: 31  |  (x,y) = ([0.00271348 0.00271514])  |  eval_value = 0.00001
iteration: 32  |  (x,y) = ([0.00230496 0.00230637])  |  eval_value = 0.00001
iteration: 33  |  (x,y) = ([0.00195619 0.00195739])  |  eval_value = 0.00001
iteration: 34  |  (x,y) = ([0.0016587  0.00165973])  |  eval_value = 0.00001
iteration: 35  |  (x,y) = ([0.00140518 0.00140606])  |  eval_value = 0.00000
iteration: 36  |  (x,y) = ([0.00118933 0.00119008])  |  eval_value = 0.00000
iteration: 37  |  (x,y) = ([0.00100572 0.00100635])  |  eval_value = 0.00000
iteration: 38  |  (x,y) = ([0.00084967 0.00085021])  |  eval_value = 0.00000
iteration: 39  |  (x,y) = ([0.00071717 0.00071763])  |  eval_value = 0.00000
iteration: 40  |  (x,y) = ([0.00060477 0.00060516])  |  eval_value = 0.00000
iteration: 41  |  (x,y) = ([0.00050951 0.00050984])  |  eval_value = 0.00000
iteration: 42  |  (x,y) = ([0.00042885 0.00042912])  |  eval_value = 0.00000
iteration: 43  |  (x,y) = ([0.00036062 0.00036085])  |  eval_value = 0.00000
iteration: 44  |  (x,y) = ([0.00030295 0.00030315])  |  eval_value = 0.00000
iteration: 45  |  (x,y) = ([0.00025426 0.00025443])  |  eval_value = 0.00000
iteration: 46  |  (x,y) = ([0.00021319 0.00021333])  |  eval_value = 0.00000
iteration: 47  |  (x,y) = ([0.00017858 0.0001787 ])  |  eval_value = 0.00000
iteration: 48  |  (x,y) = ([0.00014945 0.00014954])  |  eval_value = 0.00000
iteration: 49  |  (x,y) = ([0.00012494 0.00012502])  |  eval_value = 0.00000

在我们的例子中，我们可以看到第 35 次迭代达到了最小值。

RMSProp 的可视化

我们使用与测试样品中类似的等值线图。这一次，我们在等值线图中添加了另一个图，以映射每次迭代的解决方案轨迹。包含所有解决方案的列表用于绘制跟踪。

x_axis = arange(values_range[0,0], values_range[0,1], 0.1)
y_axis = arange(values_range[1,0], values_range[1,1], 0.1)

x, y = meshgrid(x_axis, y_axis)
# calculate results
results = objective_function(x, y)
# create a filled contour plot
plt.contourf(x, y, results, levels=80, cmap='jet')

all_solution_values = asarray(solutions)
# plot to trace solution points with each iteration
plt.plot(all_solution_values[:, 0], all_solution_values[:, 1], '.-', color='w')
plt.show()

结果，它给出了以下图。

下面给出了 RMSProp 的完整代码。

from math import sqrt
from numpy import asarray,arange,meshgrid
from numpy.random import rand, seed
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# target function is a simple second-degree equation
def objective_function(x, y):
	return x**2.0 + y**2.0

# function to find the derivative of target function x**2 +y**2
def derivative_function(x, y):
	return asarray([x * 2.0, y * 2.0])

# RMSProp algorithm
def RMSProp(objective_function, derivative_function, values_range, n_iterations, step_size, Beta):
	# list of all solution points
	all_solutions_list = list()
	# initial point generation within the range
	current_solution_point = values_range[:, 0] + rand(len(values_range)) * (values_range[:, 1] - values_range[:, 0])
	# squared gradients average
	squared_gradient_avg = [0.0 for _ in range(values_range.shape[0])]
	#Gradient Descent Algorithm looped n times
	for n in range(n_iterations):
		
		gradient = derivative_function(current_solution_point[0], current_solution_point[1])
		# loop to calculate the average of the squared gradients
		for i in range(gradient.shape[0]):
			squared_gradient = gradient[i]**2.0
			squared_gradient_avg[i] = (squared_gradient_avg[i] * Beta) + (squared_gradient * (1.0-Beta))
	 
		# update solution point with squared gradient average
		updated_solution = list()
		for j in range(current_solution_point.shape[0]):
			# Learning rate calculation
			alpha = step_size / (1e-8 + sqrt(squared_gradient_avg[j]))
			# update solution point
			updated_value = current_solution_point[i] - alpha * gradient[j]
			updated_solution.append(updated_value)
		# Store updated solution
		current_solution_point = asarray(updated_solution)
		all_solutions_list.append(current_solution_point)
		# value of the function at updated solution point
		solution_eval_value = objective_function(current_solution_point[0], current_solution_point[1])

		print('iteration: %d  |  (x,y) = (%s)  |  eval_value = %.5f' % (n, current_solution_point, solution_eval_value))
	return all_solutions_list

# seed random number generator with any value
seed(1)
# min and max bounds of the input
values_range = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# number of iterations
n_iterations = 50
# step size 
step_size = 0.01
# momentum for rmsprop, 0.99 works for most use cases
Beta = 0.99
# RMSProp gradient Descent
solutions = RMSProp(objective_function, derivative_function, values_range, n_iterations, step_size, Beta)

x_axis = arange(values_range[0,0], values_range[0,1], 0.1)
y_axis = arange(values_range[1,0], values_range[1,1], 0.1)

x, y = meshgrid(x_axis, y_axis)
# calculate results
results = objective_function(x, y)
# create a filled contour plot
plt.contourf(x, y, results, levels=80, cmap='jet')

all_solution_values = asarray(solutions)
# plot to trace solution points with each iteration
plt.plot(all_solution_values[:, 0], all_solution_values[:, 1], '.-', color='w')
plt.show()