OCR之:Pytesseract端到端文字识别(包含源代码以及极为详细的解释)

提示:此博文详细描述如何使用Pytesseract进行OCR识别,包括基本操作,图像的预处理,角度识别,图像旋转,等等。并且,这里也会附上及其详细的代码以及注解。如果感兴趣的同学可以自行下载。注:代码运行于Ubuntu AMD 环境。
基本上可以说,如果你对Pytesseract以及基于经典图像处理来做OCR识别,这篇文章就足够了。文章会有些长,喜欢的小伙伴可以点个赞,或者收藏。

代码链接:https://download.csdn.net/download/zyctimes/75141208



1. 准备

1.1 代码的安装与使用

首先,我们将代码下载后,新建一个虚拟环境:virtualenv [venv][venv]指的是我们自己命名的虚拟环境的名称。接下来进入虚拟环境:source [venv]/bin/activate。如是自己创建的新虚拟环境,还需要安装依赖:这里有两部分。第一部分是关于tesseract-ocr的安装。我没有找到痛过pip的方式。需要进行如下操作(在terminal中):

sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt install -y tesseract-ocr

然后应该就安装成功了。这里可以检查版本:

tesseract --version

除此之外的安装包只需要直接运行:pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
可以直接运行python main.py看结果(在项目的根目录下)。

1.2 代码结构

一如既往地,我们先附上代码结构,并做一些相关的解释。

--pytesseract-end2end
	|---imgRaw (测试用的照片,用于测试我们训练出来的模型的效果)
	|---imgSave (保存照片的路径)
	|---main.py (主程序)
	|---mathe.py (一些与计算有关的函数)
	|---modeltrain,py (模型的训练)
	|---modeltest.py (模型的测试)
	|---configInputsParse.py (读取config.txt文件中的参数)
	|---config.txt (所有可调参数)
	|---requirements.txt (所有安装包以及版本)
	|---imgdenoiser,pickle (ML 模型)
	|---readme.md (说明文档)

1.3 代码框架

我们首先看一下main程序:

ocrOptions = ""
# 1 - Run tesseract directly and see the results.
imgOcr.mostSimpleOCR(img, ocrOptions)
imgOcr.imgOcrReset()

# 2 - Add digits to recog, whitelist, blacklist into ocr Options.
if cfgParameters.detectDigit:
    ocrOptions += "outputbase digits"
# --whitelist : A string of characters serving as our characters which can pass through to the results
if len(cfgParameters.whitelist) != 0:
    ocrOptions += " -c tessedit_char_whitelist={} ".format(cfgParameters.whitelist)
# blacklist : Characters that must never be included in the results
if len(cfgParameters.blacklist) != 0:
    ocrOptions += " -c tessedit_char_blacklist={} ".format(cfgParameters.blacklist)  
imgOcr.mostSimpleOCR(img, ocrOptions)
imgOcr.imgOcrReset()

# 3 - Add translation module after characters detected. This translate function needs to connect to Google server, thus it may not work.
ocrOptions = ""
imgOcr.mostSimpleOCR(img, ocrOptions)
textFinalDetect = ';'.join(imgOcr.textFinalDetect)
print("textFinalDetect: ", textFinalDetect)
textFinalDetectTrans = TextBlob(textFinalDetect)
translated = textFinalDetectTrans.translate(to=cfgParameters.language2Translate)
print("Label message after translation from {} to {}.".format(cfgParameters.language2Detect, cfgParameters.language2Translate))
imgOcr.imgOcrReset()

# 4 - More complex OCR recognition in real life.
imgOcr.imgPreProcess(img)
imgOcr.imgLabelCorner()
imgOcr.imgRotateWarpLabel()
imgOcr.imgOcrDetection()
imgOcr.imgOcrReset()

从注释中,我们可以看到,这个程序包含了4个模块:

  • 直接运行pytesseract;
  • 用pytesseract里面的’黑白’名单对OCR的结果进行筛选;
  • 翻译功能
  • 比较复杂的端到端OCR检测。

2. 直接运行pytesseract

main.py中:

ocrOptions = ""
# 1 - Run tesseract directly and see the results.
imgOcr.mostSimpleOCR(img, ocrOptions)
imgOcr.imgOcrReset()

我们看一下mostSimpleOCR这个函数:

def mostSimpleOCR(self,imgInput, options):
    '''
    This function shows the simplest way of using pytesseract.
    Input image is in BGR. First we transfer to RGB format.
    Then directly go through pytesseract.image_to_data OCR function.
    Options here is set as empty string.
    You will first see the raw image, then press any key on keyboard.
    Finally you will see ocr results image, with text detected printed on terminal.
    '''
    imgInput = cv2.cvtColor(imgInput, cv2.COLOR_BGR2RGB)
    
    # Print/Display the raw image if necessary
    self.printRawImg(imgInput)
    # Recognize the characters inside image, and show it on the image, if necessary.
    self.finalOcrDetection(imgInput, options)
    print("OCR Loop Done.")
    print("Text detected: ", self.textFinalDetect)

def finalOcrDetection(self, img2Detect, options):
    '''
    Recognize the characters inside image.
    Display the characters detected on the image, if necessary.
    Save image if necessary.
    Save the characters detected if necessary.
    '''
    self.ocrResultDict = {}     # Initialization
    self.textFinalDetect = []
    self.imgOcrResult = img2Detect
    # OCR Function
    self.ocrResultDict = pytesseract.image_to_data(img2Detect,lang=self.language2Detect,config=options, output_type=Output.DICT)
    print(pytesseract.get_languages)
    print("Try to see all details results. Then you will fully understand the codes below: ", self.ocrResultDict)
    for i in range(len(self.ocrResultDict['text'])):
        if float(self.ocrResultDict['conf'][i]) > 20: 
            (x, y, w, h) = (self.ocrResultDict['left'][i], self.ocrResultDict['top'][i], self.ocrResultDict['width'][i], self.ocrResultDict['height'][i]) 
            self.imgOcrResult = cv2.rectangle(img2Detect, (x, y), (x + w, y + h), (0, 255, 0), 2) 
            self.textFinalDetect.append(self.ocrResultDict['text'][i])
    if self.imgDisplay == True:
        # Display image with characters detected.
        cv2.namedWindow("Image final",cv2.WINDOW_NORMAL)
        cv2.setWindowProperty("Image final", cv2.WND_PROP_FULLSCREEN, cv2.WINDOW_FULLSCREEN)
        cv2.imshow("Image final", self.imgOcrResult)
        cv2.waitKey(0)
        cv2.destroyAllWindows()
    # Save image if necessary.
    if self.imgSave == True:
        cv2.imwrite('{0}/ImageOcrResultsFinal.png'.format(self.imgSaveFullPath),self.imgOcrResult)
        print("Image with charcters detected saved.")
    # Save the characters detected if necessary.
    if self.txtResultSave == True:
        with open('ocrResult.txt','w') as f:
            f.writelines('\n'.join(self.textFinalDetect))
        print("Text with charcters detected saved in ocrResult.txt file.")

逻辑很简单。首先对输入的图片做BGR2RGB的函数。然后痛过finalOcrDetection函数进行ocr识别。这里的核心就是pytesseract.image_to_data这个函数。注意,在config.txt文件中,我们可以自由选择是否保存或者展示每一步的照片。

下图为一个比较完美的照片识别的结果:

在这里插入图片描述
识别到的文字包括:['Invoice', 'Number', '1785439', 'Issue', 'Date', '2020-04-08', 'Due', 'Date', '2020-05-08', '|', 'DUE', '|', '$210.07', ' ', ' ', '']

下图为一张比较接近现实的照片OCR识别后的结果:

在这里插入图片描述
识别到的文字包括:['1', '-', 'wesTPo', '903)', '227-6858', 'yHOLE', 'FOODS', '3g9', 'post', 'RD', 'WEST', '~', ';', '']

其实不难看出,如果什么都不做,仅仅只是把一张照片输入进去,只要照片有一些褶皱,或者角度,那么OCR的识别就不准了。

2. 只检测字体

Pytesseract也支持只显示字体,以及限制哪些字母不去识别。我觉得用的很少,这里我就不去赘述了。代码中也有对应,感兴趣的同学可以自行查看。

'''
# 1 - Run tesseract directly and see the results.
imgOcr.mostSimpleOCR(img, ocrOptions)
imgOcr.imgOcrReset()
'''
# 2 - Add digits to recog, whitelist, blacklist into ocr Options.
if cfgParameters.detectDigit:
    ocrOptions += "outputbase digits"
# --whitelist : A string of characters serving as our characters which can pass through to the results
if len(cfgParameters.whitelist) != 0:
    ocrOptions += " -c tessedit_char_whitelist={} ".format(cfgParameters.whitelist)
# blacklist : Characters that must never be included in the results
if len(cfgParameters.blacklist) != 0:
    ocrOptions += " -c tessedit_char_blacklist={} ".format(cfgParameters.blacklist)  
imgOcr.mostSimpleOCR(img, ocrOptions)
imgOcr.imgOcrReset()

3. 端到端的OCR

# 4 - More complex OCR recognition in real life.
imgOcr.imgPreProcess(img)
imgOcr.imgLabelCorner()
imgOcr.imgRotateWarpLabel()
imgOcr.imgOcrDetection()
imgOcr.imgOcrReset()

3.1 预处理

def imgPreProcess(self, imgInput):
    '''
    After loading the image, resize it, denoise by blurring, get the grey image, and get the edge by canny operator.
    '''
    self.imgInputResize = imutils.resize(imgInput, width=720)  
    gray = cv2.cvtColor(self.imgInputResize, cv2.COLOR_BGR2GRAY)
    self.greyImgInput=cv2.GaussianBlur(gray, (5,5), 0)
    self.ImgCannyEdge = cv2.Canny(self.greyImgInput,75,200)
    if self.imgDisplay == True:
        # Display raw image
        cv2.namedWindow("Raw Image",cv2.WINDOW_NORMAL)
        cv2.setWindowProperty("Raw Image", cv2.WND_PROP_FULLSCREEN, cv2.WINDOW_FULLSCREEN)
        cv2.imshow('Raw Image',self.imgInputResize)
        cv2.waitKey(0)
        cv2.destroyAllWindows()
        # Display Image with edge
        cv2.namedWindow("Image Edge",cv2.WINDOW_NORMAL)
        cv2.setWindowProperty("Image Edge", cv2.WND_PROP_FULLSCREEN, cv2.WINDOW_FULLSCREEN)
        cv2.imshow('Image Edge',self.ImgCannyEdge)
        cv2.waitKey(0)
        cv2.destroyAllWindows()
    if self.imgSave == True:
        cv2.imwrite('{0}/1_rawImage.png'.format(self.imgSaveFullPath),self.imgInputResize)
        cv2.imwrite('{0}/2_rawImageCanny.png'.format(self.imgSaveFullPath),self.ImgCannyEdge)
    print("Image Pre-process Done. Output: Image edge with canny operator. ")

包括图像的大小调整resize,灰度COLOR_BGR2GRAY,模糊GaussianBlur,取边界Canny。时的,对于传统的图像处理,手动调整的参数很多,而且不同的参数对于最终的结果确实会有一定影响。当我完整写完这些代码并且为了写这个博客反复调试后,深有体会。

原图:
在这里插入图片描述
取边界Canny后的效果:

在这里插入图片描述

3.2 找到标签的轮廓

第二步是取上图标签的四个点,准备做图像的旋转。

def imgLabelCorner(self):
    '''
    Loop over the contours, approximate to the curves, and filter out 
    approximate the contour. "arcLength" calculate the perimeter of the contour. If the second argument is True then it considers the contour to be closed.
    "peri" is used to calculate the epsilon value for "approxPolyDP" function with a precision factor for approximating and smoothing a shape.
    "approxPolyDP" works where there are sharp edges in the contours like a document boundary.
    '''
    cnts = cv2.findContours(self.ImgCannyEdge, cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
    # RETR_EXTERNAL: retrieves only the extreme outer contours.
    # RETR_TREE: retrieves all of the contours and reconstructs a full hierarchy of nested contours.
    # CHAIN_APPROX_SIMPLE: compresses horizontal, vertical, and diagonal segments and leaves only their end points.
    cnts = imutils.grab_contours(cnts)
    cnts = sorted(cnts, key=cv2.contourArea, reverse=True)   # Sort the contour (points) in descending order by contour size.
    labelOutlineCnts = []        # Contours that is rectangular shape
    for c in cnts:
        peri = cv2.arcLength(c, True)  
        approx = cv2.approxPolyDP(c, 0.02 * peri, True)
        # if our approximated contour has four points, then we can
        # assume we have found the outline of the label.
        if len(approx) == 4:
            labelOutlineCnts.append(approx)
            #print(labelOutlineCnts)
            break
    # If the label contour is empty then our script could not find the outline of the label so raise an error
    # Else if we get more than one set of four-point contours, then we select one with biggest area.
    if len(labelOutlineCnts) == 0:
        raise Exception(("Could not find label outline. "
                        "Try debugging your thresholding and contour steps."))
    elif len(labelOutlineCnts) > 1:
        labelOutlineCnts = sorted(labelOutlineCnts, key=cv2.contourArea, reverse=True)
        self.labelOutlineCnt = labelOutlineCnts[0]
    elif len(labelOutlineCnts) == 1:
        self.labelOutlineCnt = labelOutlineCnts[0]
    cv2.drawContours(self.imgInputResize, [self.labelOutlineCnt], -1, (0, 255, 0), 2)
    if self.imgDisplay == True:
        # draw the contour of the label on the image
        cv2.namedWindow("Label Outline",cv2.WINDOW_NORMAL)
        cv2.setWindowProperty("Label Outline", cv2.WND_PROP_FULLSCREEN, cv2.WINDOW_FULLSCREEN)
        cv2.imshow("Label Outline", self.imgInputResize)
        cv2.waitKey(0)
        cv2.destroyAllWindows()
    if self.imgSave == True:
        cv2.imwrite('{0}/3_ImageContours.png'.format(self.imgSaveFullPath),self.imgInputResize)
    print("Image label corner extraction done. Output: label corners: {0}.".format(self.labelOutlineCnt))

这个函数的核心是cv2.findContours。这里就不多做分析了,直接上结果:

在这里插入图片描述

3.2 旋转标签

第三步是旋转。既然我们四条边沿都已经找到,那么我们现在就需要旋转这张照片,让它正对着这个镜头。

def imgRotateWarpLabel(self):
   '''
    Rotate the image so that label buttom line is horizontal to 'ground'.
    Apply a four point perspective transform to both the original image
    and grayscale image to obtain a top-down bird's eye view of the label.
    '''
    img_cp = self.imgInputResize.copy()
    img_ro, pts_ro = mathe.rotate_image(img_cp, self.labelOutlineCnt.reshape(4, 2))
    gray_ro = cv2.cvtColor(img_ro, cv2.COLOR_BGR2GRAY)
    img_warped = mathe.four_point_transform(img_ro, pts_ro)
    grey_warped = mathe.four_point_transform(gray_ro, pts_ro)
    img_warped_h, img_warped_w, img_warped_c= img_warped.shape
    img_resize_width = 1080
    img_reshape_factor = img_warped_w/img_resize_width
    img_resize_height = int(img_warped_h/img_reshape_factor)
    self.imgWarpedResize = cv2.resize(img_warped, (img_resize_width, img_resize_height))
    self.greyImgWarpedResize = cv2.resize(grey_warped, (img_resize_width, img_resize_height))
    #print(img_warped.shape, self.greyImgWarpedResize.shape)
    if self.imgDisplay == True:
        # draw the contour of the label on the image
        cv2.namedWindow("Label image after rotation",cv2.WINDOW_NORMAL)
        cv2.setWindowProperty("Label image after rotation", cv2.WND_PROP_FULLSCREEN, cv2.WINDOW_FULLSCREEN)
        cv2.imshow("Label image after rotation", img_ro)
        cv2.waitKey(0)
        cv2.destroyAllWindows()
        cv2.namedWindow("Label image after warping",cv2.WINDOW_NORMAL)
        cv2.setWindowProperty("Label image after warping", cv2.WND_PROP_FULLSCREEN, cv2.WINDOW_FULLSCREEN)
        cv2.imshow("Label image after warping", self.imgWarpedResize)
        cv2.waitKey(0)
        cv2.destroyAllWindows()
    if self.imgSave == True:
        cv2.imwrite('{0}/4_ImageRotate.png'.format(self.imgSaveFullPath),img_ro)
        cv2.imwrite('{0}/5_ImageWarped.png'.format(self.imgSaveFullPath),self.imgWarpedResize)
    print("Image rotation & wrapping done. Output: Image after wrapping and resized to a standard scale {0}.".format(self.imgWarpedResize.shape))

这里的第一个部分是mathe.rotate_image这个函数,他的功能是对我们上面找到的四个点组成的4条线,计算距离以及角度。在这个案例中,我们会找到长度最短的那条线,以及角度,然后进行旋转,让那条线水平。

代码如下:

def rotate_image(image, pts):
    # We will first order the points to the sequence of: top-left, top-right, bottom-right, bottom-left
    rect_pts = order_points(pts)
    (tl, tr, br, bl) = rect_pts
    rect_pts_ = [tl, tr, br, bl, tl]
    rect_len_min = [1000,0]
    # Then we find the shortest line by calculating distance between two adjacent corners. 
    # For the shortest line, we calculate the angle to the horizontal line.
    for i in range(len(rect_pts_)-1):
        rect_len = np.linalg.norm(rect_pts_[i+1]-rect_pts_[i])
        if rect_len < rect_len_min[0]:
            rect_len_min[0] = rect_len
            rect_len_min[1] = np.arctan((rect_pts_[i+1][1]-rect_pts_[i][1])/(rect_pts_[i+1][0]-rect_pts_[i][0]))*180/np.pi


    print("rect_len_min: ", rect_len_min)
    cX = (tl[0]+tr[0]+br[0]+bl[0])/4.0
    cY = (tl[1]+tr[1]+br[1]+bl[1])/4.0
    # Rotate based on the center of label. Rotate angle is rect_len_min[1]
    M = cv2.getRotationMatrix2D((cX,cY), rect_len_min[1], 1.0)  
    #print("M: ",M)
    image_ro = cv2.warpAffine(image, M, (image.shape[1],image.shape[0]))
    # Also rotate the corner points
    pts_ro = rotate_pts(pts, M)

    return image_ro, pts_ro

结果如下图:

在这里插入图片描述
可以看到,这个图片稍微旋转脸一点,因为这里最短的一条直线是图片中最上方那条。

然后就是mathe.four_point_transform这个函数。这个函数的目的是为了尽可能取消这张图片因为拍摄角度的问题引起的畸变,代码如下:

def four_point_transform(image, pts):
    # obtain a consistent order of the points and unpack them
    # individually
    rect = order_points(pts)
    (tl, tr, br, bl) = rect
    # compute the width of the new image, which will be the
    # maximum distance between bottom-right and bottom-left
    # x-coordiates or the top-right and top-left x-coordinates
    widthA = np.sqrt(((br[0] - bl[0]) ** 2) + ((br[1] - bl[1]) ** 2))
    widthB = np.sqrt(((tr[0] - tl[0]) ** 2) + ((tr[1] - tl[1]) ** 2))
    maxWidth = max(int(widthA), int(widthB))
    # compute the height of the new image, which will be the
    # maximum distance between the top-right and bottom-right
    # y-coordinates or the top-left and bottom-left y-coordinates
    heightA = np.sqrt(((tr[0] - br[0]) ** 2) + ((tr[1] - br[1]) ** 2))
    heightB = np.sqrt(((tl[0] - bl[0]) ** 2) + ((tl[1] - bl[1]) ** 2))
    maxHeight = max(int(heightA), int(heightB))
    # now that we have the dimensions of the new image, construct
    # the set of destination points to obtain a "birds eye view",
    # (i.e. top-down view) of the image, again specifying points
    # in the top-left, top-right, bottom-right, and bottom-left
    # order
    dst = np.array([
        [0, 0],
        [maxWidth - 1, 0],
        [maxWidth - 1, maxHeight - 1],
        [0, maxHeight - 1]], dtype = "float32")
    # compute the perspective transform matrix and then apply it
    M = cv2.getPerspectiveTransform(rect, dst)
    warped = cv2.warpPerspective(image, M, (maxWidth, maxHeight))
    # return the warped image
    return warped

结果如下图:

在这里插入图片描述
那么一个问题,为什么不能直接使用mathe.four_point_transform这个函数呢?在这个场景下是可以的,但如果旋转角度过大,那么就会有问题。

最后做一个OCR识别,结果如下:

在这里插入图片描述
是不是比直接检测效果好很多?

Logo

开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!

更多推荐