Tesseract-OCR 入门使用

文章目录一、Tesseract-OCR 安装二、测试以下只针对widows平台，linux下没有测试一、Tesseract-OCR 安装Tesseract-OCR遵循Apache 2.0 license开源协议。下载地址：https://digi.bib.uni-mannheim.de/tesseract/你也可以查看源码编译安装：https://github.com/tesseract...

SongpingWang

1667人浏览 · 2018-11-14 17:31:17

SongpingWang · 2018-11-14 17:31:17 发布

文章目录

以下只针对widows平台，linux下没有测试

tesserocr与pytesseract是Python的一个OCR识别库，但其实是对tesseract做的一层Python API封装，pytesseract是Google的Tesseract-OCR引擎包装器；所以它们的核心是tesseract,因此在安装tesserocr之前，我们需要先安装tesseract

一、Tesseract-OCR 安装

Tesseract-OCR遵循Apache 2.0 license开源协议。
官方地址：https://github.com/tesseract-ocr/tesseract

下载地址：https://digi.bib.uni-mannheim.de/tesseract/
你也可以查看源码编译安装：https://github.com/tesseract-ocr/tesseract/wiki/Downloads
或者非官方安装包：https://github.com/UB-Mannheim/tesseract/wiki

windows下安装一路next
在这里插入图片描述

然后双击程序安装即可，可以勾选Additional language data(download)(如上图)选项来安装OCR识别支持的语言包，但下载语言包实在是慢，我们可以直接从https://github.com/tesseract-ocr/tessdata 下载zip的语言包压缩文件，解压后将tessdata-master中的文件复制到Tesseract的安装目录C:\Program Files (x86)\Tesseract-OCR\tessdata目录下。

添加环境变量：将安装目录C:\Program Files (x86)\Tesseract-OCR 添加到环境变量中。

这一步，我们需要选择添加语言 chinese simple
在进入安装目录，执行.\tesseract
在这里插入图片描述

二、测试

查看可用的 “语言” -–list-langs 执行：tesseract --list-langs
执行 tesseract D:\example_05.jpg D:\out 默认使用英文识别，输出out.txt
执行 tesseract D:\example_05.jpg D:\out -l eng 指定英文识别，输出out.txt
执行 tesseract D:\example_05.jpg D:\out -l eng pdf 使用英文识别，输出out.pdf
执行 tesseract --print-parameters 查看全部参数
使用 -c 选项来设定单项参数的值:
tesseract D:\example_05.jpg D:\out -l chi_sim -c language_model_ngram_on=1
使用多个 -c 选项来设置多个参数的值。
将多项参数设置写入文件，然后在识别时使用该文件:
tesseract paper.png paper -l chi_sim tess.conf
需要注意的是，如果配置文件有两个配置文件 tess_1.conf 和 tess_2.conf:
tesseract paper.png paper -l chi_sim tess_1.conf tess_2.conf
以上代码确实实现了输出：不过结果糟糕，可以试一下。

三、使用

pytesseract模块进行安装请使用whl文件安装或者使用conda安装。
执行pip install pytesseract

如果在pytesseract运行是找不到tesseract解释器，这种情况一般是在虚拟环境下会发生，我们需要将tesseract-OCR的执行文件tesseract.ext配置到windows系统中的PATH环境中，或者修改pytesseract.py文件，将其中的“tesseract_cmd”字段指定为tesseract.exe的完整路径即可。

import pytesseract
from PIL import Image


def main():
    image = Image.open("test01.png")
    text = pytesseract.image_to_string(image, lang='chi_sim')  # 使用简体中文解析图片
    print(text)
    with open("output.txt", "w") as f:  # 将识别出来的文字存到本地
        print(text)
        f.write(str(text))


if __name__ == '__main__':
    main()

报错：pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your path

# CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY
tesseract_cmd = 'tesseract'

修改为：
tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'

报错解决方法：https://blog.csdn.net/wang_hugh/article/details/80760940
或者直接如下使用方法：

import pytesseract
from PIL import Image


def picture_win(jpg_path):
    pytesseract.pytesseract.tesseract_cmd = r'D:/Program Files (x86)/Tesseract/tesseract.exe'
    p = Image.open(jpg_path)
    text = pytesseract.image_to_string(p, lang="chi_sim")
    text = str(text).replace("\n", "")
    return str(text).replace(" ", '')
picture_win()

鸣谢
https://blog.csdn.net/haluoluo211/article/details/53304900
https://www.cnblogs.com/zhangxinqi/p/9297292.html

AtomGit 开源协作平台测评赛

瓜分20万奖金获得内推名额丰厚实物奖励易参与易上手

更多推荐

mac 使用brew卸载安装node

mac 使用brew卸载安装node卸载1. 查看当前安装的node版本：node -v2. 卸载node：brew uninstall node@版本号 --force比如安装的是12.18.1，使用brew uninstall node@12 --force。还有另外两种现在不能用的方法：使用brew uninstall node，会报错：Error: No such keg: /usr/lo