PaddleOCR表格识别

PaddleOCR2.5根目录下的ppstructure文件模块是PaddleOCR提供的一个可用于复杂文档结构分析处理的OCR工具包
github文档页面:https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.5/ppstructure/README_ch.md

安装依赖

· 安装paddleocr version>=2.5

1
pip install "paddleocr>=2.5"

· 安装版面分析依赖包layoutparser

1
pip install -U https://paddleocr.bj.bcebos.com/whl/layoutparser-0.0.0-py3-none-any.whl

· 安装DocVQA依赖包paddlenlp(DocVQA功能,选装)

1
pip install paddlenlp

快速开始

在PaddleOCR/ppstructure目录下进入CMD命令行,或者创建python脚本启动
· 表格识别

1
paddleocr --image_dir=docs/table/table.jpg --type=structure --layout=false

python脚本:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import os
import cv2
from paddleocr import PPStructure,save_structure_res

table_engine = PPStructure(layout=False, show_log=True)

save_folder = './output'
img_path = 'PaddleOCR/ppstructure/docs/table/table.jpg'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

for line in result:
line.pop('img')
print(line)

· 版面分析

1
paddleocr --image_dir=docs/table/1.png --type=structure --table=false --ocr=false

python脚本:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import os
import cv2
from paddleocr import PPStructure,save_structure_res

table_engine = PPStructure(table=False, ocr=False, show_log=True)

save_folder = './output'
img_path = 'PaddleOCR/ppstructure/docs/table/1.png'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

for line in result:
line.pop('img')
print(line)

· 版面分析+表格识别

1
paddleocr --image_dir=docs/table/1.png --type=structure

python脚本:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import os
import cv2
from paddleocr import PPStructure, draw_structure_result, save_structure_res
from PIL import Image

table_engine = PPStructure(show_log=True)

save_folder = './output'
img_path = './ppstructure/docs/table/table.jpg'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

for line in result:
line.pop('img')
print(line)


font_path = './doc/fonts/simfang.ttf'
image = Image.open(img_path).convert('RGB')
im_show = draw_structure_result(image, result, font_path=font_path)
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')

参数说明

字段 说明 默认值
output excel和识别结果保存的地址 ./output/table
table_max_len 表格结构模型预测时,图像的长边resize尺度 488
table_model_dir 表格结构模型 inference 模型地址 None
table_char_dict_path 表格结构模型所用字典地址 ../ppocr/utils/dict/table_structure_dict.txt
layout_path_model 版面分析模型模型地址,可以为在线地址或者本地地址,当为本地地址时,需要指定 layout_label_map, 命令行模式下可通过–layout_label_map=’{0: “Text”, 1: “Title”, 2: “List”, 3:”Table”, 4:”Figure”}’ 指定 lp://PubLayNet/ppyolov2_r50vd_dcn_365e_publaynet/config
layout_label_map 版面分析模型模型label映射字典 None
model_name_or_path VQA SER模型地址 None
max_seq_length VQA SER模型最大支持token长度 512
label_map_path VQA SER 标签文件地址 ./vqa/labels/labels_ser.txt
layout 前向中是否执行版面分析 True
table 前向中是否执行表格识别 True
ocr 对于版面分析中的非表格区域,是否执行ocr。当layout为False时会被自动设置为False True
structure_version 表格结构化模型版本,可选 PP-STRUCTURE。PP-STRUCTURE支持表格结构化模型 pp-structure

模型下载

模型类型 模型名称 模型简介 下载地址
版面分析模型 ppyolov2_r50vd_dcn_365e_publaynet PubLayNet 数据集训练的版面分析模型,可以划分文字、标题、表格、图片以及列表5类区域 推理模型/训练模型
OCR模型 ch_PP-OCRv3_det_infer PubLayNet数据集训练的中英文超轻量PP-OCRv3模型 推理模型/训练模型
OCR模型 en_ppocr_mobile_v2.0_table_rec PubLayNet数据集训练的中英文超轻量PP-OCRv3模型 推理模型/训练模型
表格识别模型 en_ppocr_mobile_v2.0_table_structure PubLayNet数据集训练的英文表格场景的表格结构预测 推理模型/训练模型

预测示例(以版面分析+表格为例)

命令行

1
paddleocr --image_dir=docs/table/table.jpg --type=structure --layout=false

Python脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import os
import cv2
from paddleocr import PPStructure, draw_structure_result, save_structure_res
from PIL import Image

table_engine = PPStructure(show_log=True)

save_folder = './output'
img_path = './ppstructure/docs/table/table.jpg'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

for line in result:
line.pop('img')
print(line)


font_path = './doc/fonts/simfang.ttf'
image = Image.open(img_path).convert('RGB')
im_show = draw_structure_result(image, result, font_path=font_path)
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')

服务部署

安装structure_system

1
hub install deploy\hubserving\structure_system\

修改config.json,不使用GPU

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"modules_info": {
"structure_system": {
"init_args": {
"version": "1.0.0",
"use_gpu": false
},
"predict_args": {
}
}
},
"port": 8870,
"use_multiprocess": false,
"workers": 2
}

启动服务

1
hub serving start -c ./deploy/hubserving/structure_system/config.json

版面分析+表格识别

1
python tools/test_hubserving.py --server_url http://127.0.0.1:8870/predict/structure_system --image_dir ppstructure/docs/table/table.jpg

将识别得到的html标签内容复制另存为.html文件进行对比得到如下结果:

在Pycharm中部署服务并识别

进入deploy/hubserving/structure_system/params.py,修改默认模型位置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
...
from deploy.hubserving.structure_table.params import read_params as table_read_params


def read_params():
cfg = table_read_params()

# params for layout parser model
cfg.layout_path_model = 'lp://PubLayNet/ppyolov2_r50vd_dcn_365e_publaynet/config'
cfg.layout_label_map = None

cfg.mode = 'structure'
cfg.output = './output'
return cfg

可以看到,structure_system/params.py引用了structure_table/params.py下的read_params参数,再进入到structure_table/params.py文件中,该文件则是引用了ocr_system/params.py下的read_params参数,这些参数主要用作OCR识别,所以,如果在不同场景下要使用不同模型时,最好将各个params.py重写。修改后的文件如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# structure_system/params.py

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function


class Config(object):
pass


def read_params():
cfg = Config()

#params for text detector
cfg.det_algorithm = "DB"
cfg.det_model_dir = "C:/Users/9.9/.paddleocr/whl/det/ch/ch_PP-OCRv3_det_infer"
cfg.det_limit_side_len = 960
cfg.det_limit_type = 'max'

#DB parmas
cfg.det_db_thresh = 0.3
cfg.det_db_box_thresh = 0.5
cfg.det_db_unclip_ratio = 1.6
cfg.use_dilation = False
cfg.det_db_score_mode = "fast"

#EAST parmas
cfg.det_east_score_thresh = 0.8
cfg.det_east_cover_thresh = 0.1
cfg.det_east_nms_thresh = 0.2

#params for text recognizer
cfg.rec_algorithm = "CRNN"
cfg.rec_model_dir = r"C:/Users/9.9/.paddleocr/whl/rec/ch/ch_PP-OCRv3_rec_infer/"

cfg.rec_image_shape = "3, 48, 320"
cfg.rec_batch_num = 6
cfg.max_text_length = 25

cfg.rec_char_dict_path = r"C:/Users/9.9/software/PaddleOCR-release-2.5/ppocr/utils/ppocr_keys_v1.txt"
cfg.use_space_char = True

#params for text classifier
cfg.use_angle_cls = True
cfg.cls_model_dir = r"C:/Users/9.9/software/PaddleOCR-release-2.5/inference/ch_ppocr_mobile_v2.0_cls_infer/"
cfg.cls_image_shape = "3, 48, 192"
cfg.label_list = ['0', '180']
cfg.cls_batch_num = 30
cfg.cls_thresh = 0.9

cfg.use_pdserving = False
cfg.use_tensorrt = False
cfg.drop_score = 0.5

# params for table structure model
cfg.table_max_len = 488
cfg.table_model_dir = r'C:\Users\9.9\.paddleocr\whl\table\en_ppocr_mobile_v2.0_table_structure_infer/'
cfg.table_char_dict_path = 'C:/Users/9.9/software/PaddleOCR-release-2.5/ppocr/utils/dict/table_structure_dict.txt'
cfg.show_log = False

# params for layout parser model
cfg.layout_path_model = 'lp://PubLayNet/ppyolov2_r50vd_dcn_365e_publaynet/config'
# cfg.layout_path_model = './inference/ppyolov2_r50vd_dcn_365e_publaynet'
cfg.layout_label_map = None

cfg.mode = 'structure'
cfg.output = './output'
return cfg

再新建python文件用于提取具体内容,通过pandas分析得到结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import os
import re
import pandas as pd


def exec_ocr(cmd: str):
pip = os.popen(cmd)
return pip.buffer.read().decode(encoding='utf8')


img_dir = r"C:\Users\9.9\software\PaddleOCR-release-2.5\ppstructure\docs\table\table.jpg"
ocr_dir = r"python C:\Users\9.9\software\PaddleOCR-release-2.5\tools\test_hubserving.py"
res = exec_ocr(fr"{ocr_dir} --server_url http://127.0.0.1:8870/predict/structure_system --image_dir {img_dir}")

with open("system_res.txt", "w", encoding="utf-8") as f:
f.write(res)

ocr_file = open("system_res.txt", "r", encoding='utf-8')
alldata_res = ocr_file.read()
system_res = re.search(r"'html': '(.*)'}, 'type': 'Table'}", alldata_res)
print(system_res.group(1))
with open("system_res.html", "w", encoding='utf-8') as f:
f.write(system_res.group(1))
with open("system_res.html", encoding='utf-8') as f:
df = pd.read_html(f.read(), encoding='utf-8', index_col=0)[0]
df = df.loc[:, ~df.columns.str.contains("^Unnamed")]

print(df)
df.to_csv('system_res.csv')

PP-Structure表格模型训练

安装

前往Paddle的github主页下载安装PaddleDetection:https://github.com/PaddlePaddle/PaddleDetection,并执行pip install -r requirements.txt安装其他依赖

准备数据

下载PubLayNet数据集,可通过链接(https://dax-cdn.cdn.appdomain.cloud/dax-publaynet/1.0.0/publaynet.tar.gz?_ga=2.104193024.1076900768.1622560733-649911202.1622560733)直接下载(约95GB

配置文件

修改configs/ppyolo/ppyolov2_r50vd_dcn_365e_coco.yml文件的配置进行训练:

1
2
3
4
5
6
7
8
9
10
_BASE_: [
'../datasets/coco_detection.yml', # 主要说明了训练数据和验证数据的路径
'../runtime.yml', # 主要说明了公共的运行参数,比如是否使用GPU、每多少个epoch存储checkpoint等
'./_base_/ppyolov2_r50vd_dcn.yml', # 主要说明了学习率和优化器的配置
'./_base_/optimizer_365e.yml', # 主要说明模型和主干网络的情况
'./_base_/ppyolov2_reader.yml', # 主要说明数据读取器配置,如batch size,并发加载子进程数等,同时包含读取后预处理操作,如resize、数据增强等等
]

snapshot_epoch: 8
weights: output/ppyolov2_r50vd_dcn_365e_coco/model_final

来到datasets/coco_detection.yml文件中,修改下载好的训练集的位置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
metric: COCO
num_classes: 80

TrainDataset:
!COCODataSet
image_dir: C:/Users/9.9/software/publaynet/train
anno_path: C:/Users/9.9/software/publaynet/train.json
dataset_dir: dataset/coco
data_fields: ['image', 'gt_bbox', 'gt_class', 'is_crowd']

EvalDataset:
!COCODataSet
image_dir: C:/Users/9.9/software/publaynet/val
anno_path: C:/Users/9.9/software/publaynet/val.json
dataset_dir: dataset/coco

TestDataset:
!ImageFolder
anno_path: C:/Users/9.9/software/publaynet/val.json # also support txt (like VOC's label_list.txt)
dataset_dir: dataset/coco # if set, anno_path will be 'dataset_dir/anno_path'

训练

1
python tools/train.py -c configs/ppyolo/ppyolov2_r50vd_dcn_365e_coco.yml

Powered by Hexo and Hexo-theme-hiker

Copyright © 2017 - 2024 青域 All Rights Reserved.

UV : | PV :