PDFBox占用过多内存的BUG

使用PDFBox将超过80页的PDF文件转图片时，会占用很多的内存，期间还执行多次垃圾回收但是没啥子用。最近使用pdfbox 导入发现内存使用占用特别高，通过查阅官方资料发现pdfbox 在处理一些复杂的pdf 消耗内存特别高，这个是没法避免的，内存设置小的很有可能导致内存溢出。但它提供了一些可以减少内存使用的方案。具体代码就不写了，我从几个维度罗列一下。给大家使用pdfbox 后期做优化参考。这

爱上口袋的天空

2203人浏览 · 2023-08-19 21:10:32

爱上口袋的天空 · 2023-08-19 21:10:32 发布

1、简介

2、原因

3、解决方案

3.1、让PDF文件临时存放在硬盘，减少内存使用

3.2、自定义DefaultResourceCache

3.3、允许下采样

3.4、使用带压缩imageio工具类、降低分辨率

1、简介

使用PDFBox将超过80页的PDF文件转图片时，会占用很多的内存，期间还执行多次垃圾回收但是没啥子用。

最近使用pdfbox 导入发现内存使用占用特别高，通过查阅官方资料发现pdfbox 在处理一些复杂的pdf 消耗内存特别高，这个是没法避免的，内存设置小的很有可能导致内存溢出。但它提供了一些可以减少内存使用的方案。具体代码就不写了，我从几个维度罗列一下。给大家使用pdfbox 后期做优化参考。

2、原因

后来在PDF官网发现已经给出了解决办法：

I’m getting an OutOfMemoryError. What can I do?
The memory footprint depends on the PDF itself and on the resolution you use for rendering. Some possible options:

increase the -Xmx value when starting java
use a scratch file by loading files with this code PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly())
be careful not to hold your images after rendering them, e.g. avoid putting all images of a PDF into a List
don’t forgot to close your PDDocument objects
decrease the scale when calling PDFRenderer.renderImage(), or the dpi value when calling PDFRenderer.renderImageWithDPI()
disable the cache for PDImageXObject objects by calling PDDocument.setResourceCache() with a cache object that is derived from DefaultResourceCache and whose call public void put(COSObject indirect, PDXObject xobject) does nothing. Be aware that this will slow down rendering for PDF files that have an identical image in several pages (e.g. a company logo or a background). More about this can be read in PDFBOX-3700.

PDFBOX-3700里的问题内容大概是说在分析heap dump时，发现是解析后的图片被存于DefaultReourceCache中导致内存溢出：

I am using PDFBox to convert PDF documents to a series of TIFF images (one for each page). The implementation uses PDFRenderer to render each page. Things work fine when I am processing a single document in a single thread, however when I try to process multiple documents (each in its own thread) I get an OutOfMemoryException.

In analyzing the heap dump, I see that this is caused by the images cached in DefaultResourceCache. Objects are added to the cache in PDResources, which includes a method private boolean isAllowedCache(PDXObject xobject) that is used to determine whether an PDXObject can be cached. I have extended this to filter out COSName.IMAGE, and am now able to process multiple documents in parallel.

下面Tilman Hausherr也做了测试发现是SoftReference导致无法被回收：

I created a dump and had a look at it.

No PDXObjectImage objects found in class list.

By looking at SoftReference objects:
7 instances PDXObjectImage but not referenced from elsewhere, 3 BufferedImage instances

biggest in dump is int[], 6 instances, it’s from BufferedImage

15 instances of BufferedImage
14 instances of BugImgSurfaceData
13 instances of BufImgSurfaceManager

Sadly all this fails to find a cause in PDFBox itself, i.e. that we hold a PDXObjectImage or a BufferedImage for too long thus preventing the SoftReference’d objects to be recovered by gc.

DefaultResourceCache.java里的put方法用了SoftReference：

@Override
public void put(COSObject indirect, PDXObject xobject) throws IOException {
	xobjects.put(indirect, new SoftReference<PDXObject>(xobject));
}

3、解决方案

3.1、让PDF文件临时存放在硬盘，减少内存使用

PDDocument doc = PDDocument.load(stream, MemoryUsageSetting.setupTempFileOnly()

3.2、自定义DefaultResourceCache

这个默认使用的软引用如图：软引用是在内存即将溢出才会回收，所以也会生命周期会一直占用内存

在这里插入图片描述

咱们自定义不需要实现即可或者使用虚引用

在这里插入图片描述

调用PDDocument.setResourceCache(新写的子类)方法，解决内存问题！

在这里插入图片描述

3.3、允许下采样

下采样实际上就是缩小图像，主要目的是为了使得图像符合显示区域的大小，生成对应图像的缩略图所以也会减少内存使用

PDFRenderer renderer = new PDFRenderer(doc);
renderer.setSubsamplingAllowed(true);

3.4、使用带压缩imageio工具类、降低分辨率

org.apache.pdfbox.tools.imageio

			//100 越低内存越少 分别率也降低 看实际业务 太低转换图片像素低
  			BufferedImage image = pdfRenderer.renderImageWithDPI(i, 100); 
            ByteArrayOutputStream bs = new ByteArrayOutputStream();
            ImageIOUtil.writeImage(image,"png",bs);

开放原子开发者工作坊

开放原子开发者工作坊旨在鼓励更多人参与开源活动，与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动，如meetup、训练营等，主打技术交流，干货满满，真诚地邀请各位开发者共同参与！

更多推荐

人工智能在库存管理中的应用

开放原子开发者工作坊

dubbo启动报错failed to bind nettyserver on

dubbo报错今天启动项目的时候，关掉了custom服务，<dubbo:consumer check="false"/>并且关掉了spring的elastic-job，但是还是报错，看了下错误代码，原因是因...