开源库openhtmltopdf html 转pdf特殊字符适配

HTML转PDF特殊字符

漫漫蜗牛路

2804人浏览 · 2022-01-21 11:11:06

漫漫蜗牛路 · 2022-01-21 11:11:06 发布

概述

html 转pdf 是常规需求，也是一个比较让人烦躁的需求，可能出现各种不讲道理的问题。本篇主要分析html 中存在特殊字符时，如何正常转化为pdf 的问题。其他相关注意事项可参考我的其他文章,如下。

基本的`html` 转`pdf`。

这里介绍使用开源库openhtmltopdf 将html 转化为pdf。更加详细的文档可参考openhtmltopdf库操作手册。

import java.io.FileOutputStream;
import java.io.OutputStream;
import com.openhtmltopdf.pdfboxout.PdfRendererBuilder;

public class SimpleUsage 
{
    public static void main(String[] args) throws Exception { 
        try (OutputStream os = new FileOutputStream("/Users/me/Documents/pdf/out.pdf")) {
            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.useFastMode();
            builder.withUri("file:///Users/me/Documents/pdf/in.htm");
            builder.toStream(os);
            builder.run();
        }
    }
}

存在问题

通过以上代码即可把html 转化为pdf，但如果html中有特殊字符如&，则会转化失败。因为默认的严格的xmldoc，对于格式有严格要求。遇到特殊字符必须使用转义字符，否则就会因无法识别而报错。

解决方案

第一步：将html 转化为标准的`w3c DOM`，代码如下。

	public org.w3c.dom.Document html5ParseDocument(String urlStr, int timeoutMs) throws IOException 
	{
		URL url = new URL(urlStr);
		org.jsoup.nodes.Document doc;
		
		if (url.getProtocol().equalsIgnoreCase("file")) {
			doc = Jsoup.parse(new File(url.getPath()), "UTF-8");
		}
		else {
			doc = Jsoup.parse(url, timeoutMs);	
		}
		// Should reuse W3CDom instance if converting multiple documents.
		return new W3CDom().fromJsoup(doc);
	}

**注意：**调用方法html5ParseDocument（）需要传入两个参数，第一个参数是html 的文件路径或地址，如果html文件是本地文件，需要在文件路径前添加file://。如file:///user/temp/test.html。第二个参数是解析远程html 的超时时间，没特殊要求。

第二步：将 `w3c DOM` 转化为`pdf`

    public static void getPdfByHtml(String htmlPath, String pdfPath) {
        OutputStream os =null;
        try  {
            os = new FileOutputStream(pdfPath);
            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.useFastMode();
            org.w3c.dom.Document document = html5ParseDocument("file://" + htmlPath, 2000);
            builder.withW3cDocument(document, "file:///");
            builder.toStream(os);
            builder.run();
        } catch (FileNotFoundException e) {
            logger.error("html文件未找到");
            e.printStackTrace();
        } catch (IOException e) {
            logger.error("html转PDF文件 出错");
            e.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        }finally {
            try {
                os.flush();
                os.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

注意：builder.withW3cDocument(document, "file:///");这个方法的第一个参数就是我们转化后的符合w3c DOM标准的doc文件的实例，没啥好说的。第二个参数是baseUri，用来描述html文件的根目录。假设html 中有个本地图片资源：/Documents/temple/Downloads/histogram_8a754ed63131432f8a720a4db6e7ab1a_ldzc.png，那么你最少需要将根目录设置为/，也就是将baseUri设置为"file:///"。当然这个根目录必须是html 中所有资源的根目录，否则可能出现无法找到资源文件的问题。

AtomGit 开源协作平台测评赛

瓜分20万奖金获得内推名额丰厚实物奖励易参与易上手

更多推荐

白盒测试体系的探索 [ 光影人像东海陈光剑的博客 ]

什么是白盒测试？很多人都听说过白盒测试。通常的说法是，白盒测试是能看到全部产品源代码的测试。常常，白盒测试都是和牛人绑在一起的。个人认为这是一种比较狭隘的说法。然而究竟什么是白盒测试呢？可能有很多的人在做了很长时间的白盒测试以后发现，自己其实不是在做白盒测试，而是在做灰盒测试，原因是不够“白”，因为他没有看到全部的产品代码。其实，我个人认为，这是做白盒测试的误区。从广义来说，个人认为，白...

开放原子开发者工作坊

现代程序设计 homework-01

搞了6个小时individual project...看看博客做一做第一次现代程序设计作业1) 建立 GitHub 账户, 把课上做的 “最大子数组之和” 程序签入我的github地址是https://github.com/oldoldb,以前没有用过各种不熟练啊....代码我放到<现代程序设计课后作业>这个repository里了,是昨天完成的在hdu和poj上找的5道关于一...

开放原子开发者工作坊

精心收集的 48 个 JavaScript 代码片段，仅需 30 秒就可理解

原文：Chalarangelo 译文：IT168https://github.com/Chalarangelo/30-seconds-of-code#anagrams-of-string-with-duplicates该项目来自于 Github 用户 Chalarangelo，目前已在 Github 上获得了 5000 多Star，精心收集了多达 48 个有用的 JavaS...