yolo系模型的tensorrt加速

前言

trt简介摘自Ultralytics。总而言之就是能极大程度的提高模型速度,正文部分主要记录部署流程和排坑日志。

TensorRT是由NVIDIA 开发,专为高速深度学习推理而设计的SDK,该SDK可针对NVIDIA GPU优化深度学习模型,从而实现更快、更高效的操作。将深度学习模型转换为TensorRT格式可以充分发挥NVIDIA GPU的潜力,非常适合目标检测等实时应用。

TensorRT模型具有一系列关键特性:

  • 精度校准:TensorRT支持精度校准,允许针对特定精度要求对模型进行微调,包括对 INT8 和 FP16 等降低精度格式的支持,这可以进一步提高推理速度,同时保持可接受的精度水平。
  • 层融合:TensorRT优化过程包括层融合,即将神经网络的多个层合并为一个操作。这可以减少计算开销,并通过最大限度地减少内存访问和计算来提高推理速度。

  • 动态Tensor 内存管理:TensorRT在推理过程中有效管理tensor内存使用情况,减少内存开销并优化内存分配。这使得GPU的内存使用效率更高。
  • 自动内核调整:TensorRT应用自动内核调整功能,为模型的每一层选择最优化的GPU内核。这种自适应方法可确保模型充分利用 GPU的计算能力。

YOLO导出TensorRT

环境

GPU

  • NVIDIA 3090*2
  • 显卡驱动 535.104.05
  • CUDA版本 12.2
  • CUDAtoolkit (cuda_12.2.2_535.104.05_linux)
  • cuDNN (v8.9.7) (v9.5.0)

yolo版本

  • v11 (ultralytics yolov11)

pytorch版本

  • v2.1.2

python环境

  • CentOS7.9
  • anaconda3
  • python3.9

安装python依赖

首先需要更新Ultralytics源码。

pip install -U ultralytics -i https://pypi.tuna.tsinghua.edu.cn/simple

源码默认不包括trt的导出部分,所以需要单独安装onnx和tensorrt的python包。直接使用官方导出代码亦可自动安装,国内速度较慢。

pip install onnx==1.17.0 onnxruntime-gpu==1.16.3 -i https://pypi.tuna.tsinghua.edu.cn/simple

tensorrt版本官方要求>7.0.0,!=10.1.0,但是当前环境下安装高版本tensorrt会一直安装失败,推测可能是cudnn环境配置有问题。故折中选择9.0.0.post11.dev1镜像版本。

pip install tensorrt==9.0.0.post11.dev1 -i https://pypi.tuna.tsinghua.edu.cn/simple

engin模型导出

1
2
3
4
from ultralytics import YOLO

model = YOLO("yolo11n.pt")
model.export(format="engine")

踩坑日志

以下部分为踩坑日志,主要记录过程中的错误。导出部分可跳转至下一章节。

测试将yolo11n.pt模型转成trt格式,示例代码中.onnx格式生成没有任何问题,导出.engine格式时发生问题,报错:

Unable to load any of {libcudnn_engines_precompiled.so.9.5.0, libcudnn_engines_precompiled.so.9.5, libcudnn_engines_precompiled.so.9, libcudnn_engines_precompiled.so} ...
RuntimeError: CUDNN_BACKEND_TENSOR_DESCRIPTOR cudnnFinalize failed cudnn_status: CUDNN_STATUS_SUBLIBRARY_LOADING_FAILED

首先聚焦问题CUDNN_BACKEND_TENSOR_DESCRIPTOR cudnnFinalize failed cudnn_status: CUDNN_STATUS_SUBLIBRARY_LOADING_FAILED

搜索了全网,几乎没有合理的解决方案,只能定位到是CUDNN出了问题。由于Pytorch使用的是conda内部cudnn,但是日常工作中经常需要用到paddlepaddle,服务器属于物理cudnn和虚拟cudnn共存的状态。

尝试移除物理CUDNN环境、手动添加torch cudnn到系统临时环境变量,报错依然存在。开始重审报错信息,发现开头的Unable to load any of {libcudnn_engines_precompiled.so.9.5.0, libcudnn_engines_precompiled.so.9.5, libcudnn_engines_precompiled.so.9, libcudnn_engines_precompiled.so}

这里就清晰明了了,当前cudnn环境下缺了libcudnn_engines_precompiled.so*,进行全盘搜索:

1
find / -name libcudnn_engines_precompiled.so*

检查了一下conda环境中存在libcudnn_engines_precompiled.so,而物理环境/usr/local/cuda下并未找到,猜测仍可能是读取物理cudnn环境而导致的冲突。重新检索了物理环境下的cudnn原始安装包,发现确实不存在libcudnn_engines_precompiled*软件,猜测可能需要单独安装tensorrt。

于是单独安装pycuda,下载了tensorrt。解压后发现仍然没有libcudnn_engines_precompiled.so*的影子。

最后推测需要更新服务器cudnn试试。这里踩了一个坑,我安装cudnn的地址是https://developer.nvidia.com/rdp/cudnn-archive,这里的cudnn最高版本只到v8.9.7,所以当我下载后看到曾经装过的安装包以后相当疑惑。毕竟已经安装一年了,不可能还没有更新。立马想清楚地址有问题,要找新下载地址。

功夫不负有心人,找到了新的下载链接。因为提示缺少的版本是.9.5,所以我选择了cudnn v9.5.0版本:https://developer.nvidia.com/cudnn-9-5-0-download-archive

1
wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.5.0.50_cuda12-archive.tar.xz

解压后更新cudnn到9.5.0。

1
2
3
4
5
6
xz -d cudnn-linux-x86_64-9.5.0.50_cuda12-archive.tar.xz
tar -zxvf cudnn-linux-x86_64-9.5.0.50_cuda12-archive.tar
cp cudnn-linux-x86_64-9.5.0.50_cuda12-archive/include/* /usr/local/cuda-12.2/include/
cp cudnn-linux-x86_64-9.5.0.50_cuda12-archive/lib/libcudnn* /usr/local/cuda-12.2/lib64/
chmod a+r /usr/local/cuda-12.2/include/cudnn*.h
chmod a+r /usr/local/cuda-12.2/lib64/libcudnn*

检查版本:

1
2
3
4
5
6
7
8
[root@master ~]# cat /usr/local/cuda-12.2/targets/x86_64-linux/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 9
#define CUDNN_MINOR 5
#define CUDNN_PATCHLEVEL 0
--
#define CUDNN_VERSION (CUDNN_MAJOR * 10000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

/* cannot use constexpr here since this is a C-only file */

按理说trt读取不到虚拟cudnn下的libcudnn_engines_precompiled.so*,那么正常应该会读取物理cudnn的。但是发现报错依然存在。于是添加临时系统变量,将其强制加载到环境中:

1
export LD_PRELOAD=/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_engines_precompiled.so

这回妥了,系统可以检索到libcudnn_engines_precompiled.so。但是接下来报了新的错误:

/lib64/libm.so.6: version 'GLIBC_2.27' not found

很搞心态,但是不难看出只是缺少了这个版本的GNU库。验证一下,查看系统中当前GLIBC版本

1
strings /lib64/libc.so.6 | grep GLIBC

确实最高只到了2.17

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
GLIBC_2.2.5
GLIBC_2.2.6
GLIBC_2.3
GLIBC_2.3.2
GLIBC_2.3.3
GLIBC_2.3.4
GLIBC_2.4
GLIBC_2.5
GLIBC_2.6
GLIBC_2.7
GLIBC_2.8
GLIBC_2.9
GLIBC_2.10
GLIBC_2.11
GLIBC_2.12
GLIBC_2.13
GLIBC_2.14
GLIBC_2.15
GLIBC_2.16
GLIBC_2.17
GLIBC_PRIVATE

接下来更新GLIBC

1
2
3
4
5
wget http://ftp.gnu.org/gnu/glibc/glibc-2.27.tar.gz
tar xf glibc-2.27.tar.gz
cd glibc-2.27/ && mkdir build && cd build
../configure --prefix=/usr --disable-profile --enable-add-ons --with-headers=/usr/include --with-binutils=/usr/bin
make -j4 && make install

此过程中还需要安装一个插件:

1
yum install bison

在生成Makefile过程中照例发生报错(真是习以为常了):

LD_LIBRARY_PATH shouldn't contain the current directory when building glibc. Please change the envir

解决:LD_LIBRARY_PATH不能包含当前目录,需要修改环境变量并重新执行configure

1
2
3
4
[root@master ~]# echo $LD_LIBRARY_PATH
:/usr/local/cuda-12.2
[root@master ~]# export LD_LIBRARY_PATH=
[root@master ~]# echo $LD_LIBRARY_PATH

终于成功生成makefile。接下来进行make时还是报错了:

cc1: all warnings being treated as errors

问题的本质上是在生成配置文件时默认添加 “-Werror” 编译器标记,CentOS7.9在编译时 warning 被当作 error 处理。网上搜索到都是去Makefile中关闭所有警告注释,但是GLIBCMakefile中没有 “-Werror” 。去其他文件中进行修改:

config.make:

1
enable-werror = no

config.status:

1
S["enable_werror"]="no"

Makefile添加:

1
2
CFLAGS = -Wall -Wpointer-arith -Wno-unused
KBUILD_CFLAGS += -w

继续执行make,报错:

make[2]: *** [/usr/local/glibc-2.28/build/link-defines.h] Error 1 *** These critical programs are missing or too old: make

更新make解决。

1
2
3
4
5
6
7
8
9
10
wget --no-check-certificate https://ftp.gnu.org/gnu/make/make-4.3.tar.gz
tar -xzvf make-4.3.tar.gz
cd make-4.3/
./configure --prefix=/usr/local/make
make
make install
cd /usr/bin/
mv make make.bak
ln -sv /usr/local/make/bin/make /usr/bin/make
make -v

此时再执行make -j 4,终于成功。执行make install最后报错信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/usr/bin/ld: cannot find -lnss_test2: No such file or directory
collect2: error: ld returned 1 exit status
Execution of gcc -B/usr/bin/ failed!
The script has found some problems with your installation!
Please read the FAQ and the README file and check the following:
- Did you change the gcc specs file (necessary after upgrading from
Linux libc5)?
- Are there any symbolic links of the form libXXX.so to old libraries?
Links like libm.so -> libm.so.5 (where libm.so.5 is an old library) are wrong,
libm.so should point to the newly installed glibc file - and there should be
only one such link (check e.g. /lib and /usr/lib)
You should restart this script from your build directory after you've
fixed all problems!
Btw. the script doesn't work if you're installing GNU libc not as your
primary library!
make[1]: *** [Makefile:111: install] Error 1
make[1]: Leaving directory '/usr/local/glibc-2.28'
make: *** [Makefile:15: install] Error 2

心已凉透。缓过神看此报错信息大致意思是,/lib64/libm.so.x (/lib64/libm.so.5)已经被占用了。需要我们解决这个占用的问题。

但是实际查看/lib64/libm.so.6的软链,已经链接到了需要的版本。检查GLIBC版本:

1
2
3
4
5
6
[root@master build]# ldd --version
ldd (GNU libc) 2.28
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

安装成功了。

使用.engine模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import time

from ultralytics import YOLO

# Load the exported TensorRT model
tensorrt_model = YOLO("yolo11n.engine")

t = time.time()
# 首次预测,模型加载占用时间较长
results = tensorrt_model("test.jpg")
print(time.time() - t)

t = time.time()
# 第二次预测,开始进入正常速度
results2 = tensorrt_model("test.jpg")
print(time.time() - t)

# 循环测试,查看稳定的输出时间
for _ in range(999):
t = time.time()
tensorrt_model("test.jpg", device="cuda:1", save=False)
print(time.time() - t)
1
2
3
4
5
...
image 1/1 /exp/work/video/yolov11trt/bus.jpg: 640x480 4 persons, 1 bus, 0.9ms
Speed: 2.0ms preprocess, 0.9ms inference, 1.0ms postprocess per image at shape (1, 3, 640, 480)
0.009032964706420898
...

可以看出稳定情况下约9ms/图,但是实际log日志输出的预处理+预测+后处理时间约4ms,还有一半时间不知道做什么用了。

导出FP16和INT8模型

FP16

1
2
3
4
5
6
7
from ultralytics import YOLO

# Load the YOLO11 model
model = YOLO("yolo11n.pt")

# Export the model to TensorRT format
model.export(format="engine", half=True, imgsz=(640, 640),) # creates 'yolo11n.engine'

Int8

添加coco数据集

yaml文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# Ultralytics YOLO 🚀, AGPL-3.0 license
# COCO128 dataset https://www.kaggle.com/ultralytics/coco128 (first 128 images from COCO train2017) by Ultralytics
# Documentation: https://docs.ultralytics.com/datasets/detect/coco/
# Example usage: yolo train data=coco128.yaml
# parent
# ├── ultralytics
# └── datasets
# └── coco128 ← downloads here (7 MB)

## Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, ..]

train: /exp/work/video/yolov11trt/coco128/images/train2017
val: /exp/work/video/yolov11trt/coco128/images/train2017
test:

# Classes
names:
0: person
1: bicycle
2: car
3: motorcycle
4: airplane
5: bus
6: train
7: truck
8: boat
9: traffic light
10: fire hydrant
11: stop sign
12: parking meter
13: bench
14: bird
15: cat
16: dog
17: horse
18: sheep
19: cow
20: elephant
21: bear
22: zebra
23: giraffe
24: backpack
25: umbrella
26: handbag
27: tie
28: suitcase
29: frisbee
30: skis
31: snowboard
32: sports ball
33: kite
34: baseball bat
35: baseball glove
36: skateboard
37: surfboard
38: tennis racket
39: bottle
40: wine glass
41: cup
42: fork
43: knife
44: spoon
45: bowl
46: banana
47: apple
48: sandwich
49: orange
50: broccoli
51: carrot
52: hot dog
53: pizza
54: donut
55: cake
56: chair
57: couch
58: potted plant
59: bed
60: dining table
61: toilet
62: tv
63: laptop
64: mouse
65: remote
66: keyboard
67: cell phone
68: microwave
69: oven
70: toaster
71: sink
72: refrigerator
73: book
74: clock
75: vase
76: scissors
77: teddy bear
78: hair drier
79: toothbrush

# Download script/URL (optional)
#download: https://ultralytics.com/assets/coco128.zip

#v5loader: True
#fl_gamma: 0.0
#image_weights: False
1
2
3
4
5
6
7
8
9
10
11
from ultralytics import YOLO

model = YOLO("yolo11n_int8.pt")
model.export(
format="engine",
int8=True,
data='/exp/work/video/yolov11trt/coco128/coco128.yaml'
)

# Load the exported TensorRT INT8 model
model = YOLO("yolo11n_int8.engine")

预测方式类似。有其他内容后续继续更新。

Powered by Hexo and Hexo-theme-hiker

Copyright © 2017 - 2025 青域 All Rights Reserved.

UV : | PV :