PaddleOCR表格识别

2022-07-21 机器学习 PV:

PaddleOCR2.5根目录下的ppstructure文件模块是PaddleOCR提供的一个可用于复杂文档结构分析处理的OCR工具包
github文档页面：https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.5/ppstructure/README_ch.md

安装依赖

· 安装paddleocr version>=2.5

1	pip install "paddleocr>=2.5"

· 安装版面分析依赖包layoutparser

1	pip install -U https://paddleocr.bj.bcebos.com/whl/layoutparser-0.0.0-py3-none-any.whl

· 安装DocVQA依赖包paddlenlp（DocVQA功能，选装）

1	pip install paddlenlp

快速开始

在PaddleOCR/ppstructure目录下进入CMD命令行，或者创建python脚本启动
· 表格识别

1	paddleocr --image_dir=docs/table/table.jpg --type=structure --layout=false

python脚本：

import os
import cv2
from paddleocr import PPStructure,save_structure_res

table_engine = PPStructure(layout=False, show_log=True)

save_folder = './output'
img_path = 'PaddleOCR/ppstructure/docs/table/table.jpg'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

for line in result:
    line.pop('img')
    print(line)

· 版面分析

1	paddleocr --image_dir=docs/table/1.png --type=structure --table=false --ocr=false

python脚本：

import os
import cv2
from paddleocr import PPStructure,save_structure_res

table_engine = PPStructure(table=False, ocr=False, show_log=True)

save_folder = './output'
img_path = 'PaddleOCR/ppstructure/docs/table/1.png'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

for line in result:
    line.pop('img')
    print(line)

· 版面分析+表格识别

1	paddleocr --image_dir=docs/table/1.png --type=structure

python脚本：

import os
import cv2
from paddleocr import PPStructure, draw_structure_result, save_structure_res
from PIL import Image

table_engine = PPStructure(show_log=True)

save_folder = './output'
img_path = './ppstructure/docs/table/table.jpg'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

for line in result:
    line.pop('img')
    print(line)


font_path = './doc/fonts/simfang.ttf'
image = Image.open(img_path).convert('RGB')
im_show = draw_structure_result(image, result, font_path=font_path)
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')

参数说明

字段	说明	默认值
output	excel和识别结果保存的地址	./output/table
table_max_len	表格结构模型预测时，图像的长边resize尺度	488
table_model_dir	表格结构模型 inference 模型地址	None
table_char_dict_path	表格结构模型所用字典地址	../ppocr/utils/dict/table_structure_dict.txt
layout_path_model	版面分析模型模型地址，可以为在线地址或者本地地址，当为本地地址时，需要指定 layout_label_map, 命令行模式下可通过–layout_label_map=’{0: “Text”, 1: “Title”, 2: “List”, 3:”Table”, 4:”Figure”}’ 指定	lp://PubLayNet/ppyolov2_r50vd_dcn_365e_publaynet/config
layout_label_map	版面分析模型模型label映射字典	None
model_name_or_path	VQA SER模型地址	None
max_seq_length	VQA SER模型最大支持token长度	512
label_map_path	VQA SER 标签文件地址	./vqa/labels/labels_ser.txt
layout	前向中是否执行版面分析	True
table	前向中是否执行表格识别	True
ocr	对于版面分析中的非表格区域，是否执行ocr。当layout为False时会被自动设置为False	True
structure_version	表格结构化模型版本，可选 PP-STRUCTURE。PP-STRUCTURE支持表格结构化模型	pp-structure

模型下载

模型类型	模型名称	模型简介	下载地址
版面分析模型	ppyolov2_r50vd_dcn_365e_publaynet	PubLayNet 数据集训练的版面分析模型，可以划分文字、标题、表格、图片以及列表5类区域	推理模型/训练模型
OCR模型	ch_PP-OCRv3_det_infer	PubLayNet数据集训练的中英文超轻量PP-OCRv3模型	推理模型/训练模型
OCR模型	en_ppocr_mobile_v2.0_table_rec	PubLayNet数据集训练的中英文超轻量PP-OCRv3模型	推理模型/训练模型
表格识别模型	en_ppocr_mobile_v2.0_table_structure	PubLayNet数据集训练的英文表格场景的表格结构预测	推理模型/训练模型

预测示例（以版面分析+表格为例）

命令行

1	paddleocr --image_dir=docs/table/table.jpg --type=structure --layout=false

Python脚本

import os
import cv2
from paddleocr import PPStructure, draw_structure_result, save_structure_res
from PIL import Image

table_engine = PPStructure(show_log=True)

save_folder = './output'
img_path = './ppstructure/docs/table/table.jpg'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

for line in result:
    line.pop('img')
    print(line)


font_path = './doc/fonts/simfang.ttf'
image = Image.open(img_path).convert('RGB')
im_show = draw_structure_result(image, result, font_path=font_path)
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')

服务部署

安装structure_system

1	hub install deploy\hubserving\structure_system\

修改config.json，不使用GPU

{
    "modules_info": {
        "structure_system": {
            "init_args": {
                "version": "1.0.0",
                "use_gpu": false
            },
            "predict_args": {
            }
        }
    },
    "port": 8870,
    "use_multiprocess": false,
    "workers": 2
}

启动服务

1	hub serving start -c ./deploy/hubserving/structure_system/config.json

版面分析+表格识别

1	python tools/test_hubserving.py --server_url http://127.0.0.1:8870/predict/structure_system --image_dir ppstructure/docs/table/table.jpg

将识别得到的html标签内容复制另存为.html文件进行对比得到如下结果：

在Pycharm中部署服务并识别

进入deploy/hubserving/structure_system/params.py，修改默认模型位置

...
from deploy.hubserving.structure_table.params import read_params as table_read_params


def read_params():
    cfg = table_read_params()

    # params for layout parser model
    cfg.layout_path_model = 'lp://PubLayNet/ppyolov2_r50vd_dcn_365e_publaynet/config'
    cfg.layout_label_map = None

    cfg.mode = 'structure'
    cfg.output = './output'
    return cfg

可以看到，structure_system/params.py引用了structure_table/params.py下的read_params参数，再进入到structure_table/params.py文件中，该文件则是引用了ocr_system/params.py下的read_params参数，这些参数主要用作OCR识别，所以，如果在不同场景下要使用不同模型时，最好将各个params.py重写。修改后的文件如下：

# structure_system/params.py

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function


class Config(object):
    pass


def read_params():
    cfg = Config()

    #params for text detector
    cfg.det_algorithm = "DB"
    cfg.det_model_dir = "C:/Users/9.9/.paddleocr/whl/det/ch/ch_PP-OCRv3_det_infer"
    cfg.det_limit_side_len = 960
    cfg.det_limit_type = 'max'

    #DB parmas
    cfg.det_db_thresh = 0.3
    cfg.det_db_box_thresh = 0.5
    cfg.det_db_unclip_ratio = 1.6
    cfg.use_dilation = False
    cfg.det_db_score_mode = "fast"

    #EAST parmas
    cfg.det_east_score_thresh = 0.8
    cfg.det_east_cover_thresh = 0.1
    cfg.det_east_nms_thresh = 0.2

    #params for text recognizer
    cfg.rec_algorithm = "CRNN"
    cfg.rec_model_dir = r"C:/Users/9.9/.paddleocr/whl/rec/ch/ch_PP-OCRv3_rec_infer/"

    cfg.rec_image_shape = "3, 48, 320"
    cfg.rec_batch_num = 6
    cfg.max_text_length = 25

    cfg.rec_char_dict_path = r"C:/Users/9.9/software/PaddleOCR-release-2.5/ppocr/utils/ppocr_keys_v1.txt"
    cfg.use_space_char = True

    #params for text classifier
    cfg.use_angle_cls = True
    cfg.cls_model_dir = r"C:/Users/9.9/software/PaddleOCR-release-2.5/inference/ch_ppocr_mobile_v2.0_cls_infer/"
    cfg.cls_image_shape = "3, 48, 192"
    cfg.label_list = ['0', '180']
    cfg.cls_batch_num = 30
    cfg.cls_thresh = 0.9

    cfg.use_pdserving = False
    cfg.use_tensorrt = False
    cfg.drop_score = 0.5

    # params for table structure model
    cfg.table_max_len = 488
    cfg.table_model_dir = r'C:\Users\9.9\.paddleocr\whl\table\en_ppocr_mobile_v2.0_table_structure_infer/'
    cfg.table_char_dict_path = 'C:/Users/9.9/software/PaddleOCR-release-2.5/ppocr/utils/dict/table_structure_dict.txt'
    cfg.show_log = False

    # params for layout parser model
    cfg.layout_path_model = 'lp://PubLayNet/ppyolov2_r50vd_dcn_365e_publaynet/config'
    # cfg.layout_path_model = './inference/ppyolov2_r50vd_dcn_365e_publaynet'
    cfg.layout_label_map = None

    cfg.mode = 'structure'
    cfg.output = './output'
    return cfg

再新建python文件用于提取具体内容，通过pandas分析得到结构

import os
import re
import pandas as pd


def exec_ocr(cmd: str):
    pip = os.popen(cmd)
    return pip.buffer.read().decode(encoding='utf8')


img_dir = r"C:\Users\9.9\software\PaddleOCR-release-2.5\ppstructure\docs\table\table.jpg"
ocr_dir = r"python C:\Users\9.9\software\PaddleOCR-release-2.5\tools\test_hubserving.py"
res = exec_ocr(fr"{ocr_dir} --server_url http://127.0.0.1:8870/predict/structure_system --image_dir {img_dir}")

with open("system_res.txt", "w", encoding="utf-8") as f:
    f.write(res)

ocr_file = open("system_res.txt", "r", encoding='utf-8')
alldata_res = ocr_file.read()
system_res = re.search(r"'html': '(.*)'}, 'type': 'Table'}", alldata_res)
print(system_res.group(1))
with open("system_res.html", "w", encoding='utf-8') as f:
    f.write(system_res.group(1))
with open("system_res.html", encoding='utf-8') as f:
    df = pd.read_html(f.read(), encoding='utf-8', index_col=0)[0]
    df = df.loc[:, ~df.columns.str.contains("^Unnamed")]

print(df)
df.to_csv('system_res.csv')

PP-Structure表格模型训练

安装

前往Paddle的github主页下载安装PaddleDetection：https://github.com/PaddlePaddle/PaddleDetection，并执行pip install -r requirements.txt安装其他依赖

准备数据

下载PubLayNet数据集，可通过链接（https://dax-cdn.cdn.appdomain.cloud/dax-publaynet/1.0.0/publaynet.tar.gz?_ga=2.104193024.1076900768.1622560733-649911202.1622560733）直接下载（约95GB）

配置文件

修改configs/ppyolo/ppyolov2_r50vd_dcn_365e_coco.yml文件的配置进行训练：

_BASE_: [
  '../datasets/coco_detection.yml',   # 主要说明了训练数据和验证数据的路径
  '../runtime.yml',                   # 主要说明了公共的运行参数，比如是否使用GPU、每多少个epoch存储checkpoint等
  './_base_/ppyolov2_r50vd_dcn.yml',  # 主要说明了学习率和优化器的配置
  './_base_/optimizer_365e.yml',      # 主要说明模型和主干网络的情况
  './_base_/ppyolov2_reader.yml',     # 主要说明数据读取器配置，如batch size，并发加载子进程数等，同时包含读取后预处理操作，如resize、数据增强等等
]

snapshot_epoch: 8
weights: output/ppyolov2_r50vd_dcn_365e_coco/model_final

来到datasets/coco_detection.yml文件中，修改下载好的训练集的位置

metric: COCO
num_classes: 80

TrainDataset:
  !COCODataSet
    image_dir: C:/Users/9.9/software/publaynet/train
    anno_path: C:/Users/9.9/software/publaynet/train.json
    dataset_dir: dataset/coco
    data_fields: ['image', 'gt_bbox', 'gt_class', 'is_crowd']

EvalDataset:
  !COCODataSet
    image_dir: C:/Users/9.9/software/publaynet/val
    anno_path: C:/Users/9.9/software/publaynet/val.json
    dataset_dir: dataset/coco

TestDataset:
  !ImageFolder
    anno_path: C:/Users/9.9/software/publaynet/val.json # also support txt (like VOC's label_list.txt)
    dataset_dir: dataset/coco # if set, anno_path will be 'dataset_dir/anno_path'

训练

1	python tools/train.py -c configs/ppyolo/ppyolov2_r50vd_dcn_365e_coco.yml