解决tensorflow gpu报错: ran out of memory (OOM)

报错原因及解决方案报错现象：Allocator (GPU_0_bfc) ran out of memory trying to allocate 200.00MiB (rounded to 209715200).Current allocation summary follows.<省略>Resource exhausted: OOM when allocating ten...

pyxiea

16826人浏览 · 2019-10-22 20:16:24

pyxiea · 2019-10-22 20:16:24 发布

报错原因及解决方案

报错现象：

Allocator (GPU_0_bfc) ran out of memory trying to allocate 200.00MiB (rounded to 209715200).  Current allocation summary follows.
<省略>
Resource exhausted: OOM when allocating tensor with shape[51200,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc.
<省略>
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[51200,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node word/word/bi-lstm-0/bi-lstm-0/bw/bw/while/lstm_cell/MatMul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[add/_77]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[51200,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node word/word/bi-lstm-0/bi-lstm-0/bw/bw/while/lstm_cell/MatMul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

报错原因在于GPU显存不够用了，可以在运行过程通过命令watch -n 1 nvidia-smi查看GPU的显存利用率，退出用ctrl+C。注意中间那一列才是显存利用情况"已分配显存/可用总显存"，最右边那列是GPU利用率，这是另一回事（这就和内存利用率和CPU利用率是两回事是同一个道理）。

在这里插入图片描述

解决方案有几种，根据自己的情况来选择：

如果没有设置允许tensorflow根据需要自动增加申请的显存，可以先尝试这个，看看用上单个GPU的所有显存能否解决。设置自动增长的代码见后文。
如果网络中用到了RNN，可以使用swap_memory=True选项以减少其对显存的占用。例如tf.nn.bidirectional_dynamic_rnn()方法就有这个参数。设置之后，tensorflow会将RNN前向运算产生但反向传播需要用到的tensor从GPU转移到CPU中（从显存转移到内存），这几乎不会（甚至完全不会）带来性能上的损失。
减小batch_size或减小RNN的序列的最大长度（即时间步长）
换个显存更大的GPU吧

P.S. 设置显存自动增长的代码：

 	config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    sess = tf.Session(config=config)

References

开放原子开发者工作坊

开放原子开发者工作坊旨在鼓励更多人参与开源活动，与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动，如meetup、训练营等，主打技术交流，干货满满，真诚地邀请各位开发者共同参与！

更多推荐

操作系统大会&openEuler Summit 2024参会指南，请查收！

开放原子开发者工作坊

推动工业软件核心技术攻关，开源工业软件算法集成大赛正式启动！

开放原子开发者工作坊

第二届openEuler生态大会（中国·湖南）成功举办

10月30日，第二届openEuler生态大会（中国·湖南）成功举办。

开放原子开发者工作坊

所有评论(0)

查看更多评论

pyxiea

@xpy870663266

已为社区贡献1条内容