Python爬虫：识别网页的验证码（tesseract-OCR）

2022-04-29 Python PV:

在实际的爬虫操作中，处于安全等原因网站会设置非常多的反爬虫手段来限制网络爬虫，最常见的比如设置图形验证码，来识别访客是否为机器人

但是由于python拥有许多强大的图像识别库，所以最简单也是最原始的图形验证码目前已经慢慢没落了，我找了很久发现中国知网的注册界面目前居然还在使用中~ ^ ^

获取网页验证码

首先是获取验证码图片的目标链接，我使用xpath找到html标签中验证码的src链接，读取内容后发现并不能得到生成验证码的对应id，以为是知网做了对爬虫头的限制，于是在headers里添加了“Referer”和“Accept-Encoding”等信息，但是并没有效果。抓包在Doc中发现网页HTTP请求的方法为”GET“方法，尝试不加id也得到了验证码图片

import requests
from lxml import etree

headers = {
    "method": "GET",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Cache-Control": "max-age=0",
    "scheme": "https",
    "accept-language": "zh-CN,zh;q=0.9",
    "Cookie": "U:M_distinctid=1809edfcf6e8ff-088ab0a6edb385-17333273-1fa400-1809edfcf6f9fe; _pk_ref=["","",1651932918,'https://www.baidu.com/link?url=PnlS4wubLFxXVNXrTBRCZMAhA0P5TZLlCBAhBvXfLke&wd=&eqid=e52482a8001102070000000562767ee2']; Ecp_ClientId=1220507221503170167; Ecp_IpLoginFail=22050736.161.51.128; Ecp_ClientIp=36.161.51.128; language=chs; _pk_id=2ce0e7fb-229f-4440-a78b-57771deac2fe.1651932918.1.1651933578.1651932918.; ASP.NET_SessionId=he1no3kmgo3bwuktv1xa1lhs; SID_mycnki=020101",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
    "Referer": "https://www.cnki.net/"
    "Sec-Fetch-Dest:" 'document',
    "Sec-Fetch-Mode": 'navigate',
    "Sec-Fetch-Site": 'same-origin',
    "Sec-Fetch-User": '?1',
    "Upgrade-Insecure-Requests": '1'
}
url = "https://my.cnki.net/Register/CommonRegister.aspx?returnUrl=https://www.cnki.net#"
r = requests.get(url, headers=headers).content.decode('utf-8')
html = etree.HTML(r)
link=html.xpath('//*[@id="commonRe"]/div[10]/div[3]/a/img/@src')[0]
img_url = "https://my.cnki.net/Register/" + link
img = requests.get(img_url, headers=headers,stream=True)

with open('checkcode.jpg','wb')as file:
    file.write(img.raw.read())

Tesserocr库

Tesserocr是python的一个OCR识别库，是通过对tesseract库做API封装来实现功能，核心仍是tesseract。所以需提前安装tesseract库来进行支持

安装Tesseract

进入官网链接https://digi.bib.uni-mannheim.de/tesseract/，找到当前合适的版本，下载安装程序

安装过程中可以勾选安装额外的语言文件，添加对包括中文等多语种的支持

安装完成后去系统设置中添加环境变量

新增一个变量TESSDATA_PREFIX

打开CMD，输入tesseract -v，此时出现版本信息，tesseract已经安装完成了

如果添加后控制台仍提示”tesseract不是内部或者外部命令”，继续到系统变量的path中添加
“%SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem“
即可

安装Tesserocr

安装完Tesseract后，继续安装python的OCR识别库,打开CMD或Pycharm

1 2	pip install Pillow pip install pytesseract

完成后找到pytesseract文件，将tesseract.exe程序的路径添加进去（注：不同版本路径代码在程序中的位置不一定相同）

最后再执行安装tesserocr

1	pip install tesserocr

如果报错，进入tesserocr在github的官网https://github.com/simonflueckiger/tesserocr-windows_build/releases，下载对应版本的whl文件进行手动安装

进入对应路径，来到控制台执行pip install tesserocr-2.5.2-cp37-cp37m-win_amd64.whl

最后显示tesserocr安装成功

图像处理

灰度图

使用Image的open方法将下载的checkcode.jpg打开，得到一个PIL.Image.Image对象，可以通过convert方法来将RGB彩色图片转换为灰度图

from PIL import Image
import tesserocr

img = Image.open("checkcode.jpg")
img.show()
img_grey = img.convert("L")
img_grey.show()

得到原始图片和灰度图像，使用tesserocr库进行识别，由于图像的干绕因素和噪点还比较多，目前还不能识别出准确结果

from PIL import Image
import tesserocr

img = Image.open("checkcode.jpg")
img.show()
img_grey = img.convert("L")
img_grey.show()
img_grey.save('checkcode_grey.jpg')
checkcode = tesserocr.image_to_text(img)
checkcode_grey = tesserocr.image_to_text(img_grey)
print(f"checkcode:{checkcode}")
print(f"checkcode_grey:{checkcode_grey}")

二值化灰度

从得到的灰度图结果来看，需要识别的数字和字母（主体部分）明显要比干绕部分清晰的，灰度也更重，所以接下来可对灰度图进行二值化，即设定一个阈值，每个像素的灰度将依照和阈值的关系修改为黑色或白色，这样就可以很好的凸显出主体部分

from PIL import Image

def bin_table(threshold=125):
    table = []
    for i in range(256):
        if i < threshold:
            table.append(0)
        else:
            table.append(1)
    return table

img = Image.open('checkcode_grey.jpg')
table = bin_table()
bin = img.point(table, '1')
bin.save('checkcode_bin.jpg')

此时再对图像进行识别，得到的结果为”g4YS”,已经很接近了，但是问题出在哪里呢？

将图像放大：

可以发现图像中还要很多噪点

图像降噪

py中降噪的方法有很多，像椒盐降噪、高斯降噪、均值滤波降噪等，需要找到合适的降噪方法。这里根据图片，使用8邻域降噪的方法，代码如下

from PIL import Image

def sum_9_region_new(img, x, y):
    '''确定噪点 '''
    cur_pixel = img.getpixel((x, y))  # 当前像素点的值
    width = img.width
    height = img.height

    if cur_pixel == 1:  # 如果当前点为白色区域,则不统计邻域值
        return 0

    # 因当前图片的四周都有黑点，所以周围的黑点可以去除
    if y < 3:  # 本例中，前两行的黑点都可以去除
        return 1
    elif y > height - 3:  # 最下面两行
        return 1
    else:  # y不在边界
        if x < 3:  # 前两列
            return 1
        elif x == width - 1:  # 右边非顶点
            return 1
        else:  # 具备9领域条件的
            sum = img.getpixel((x - 1, y - 1)) \
                  + img.getpixel((x - 1, y)) \
                  + img.getpixel((x - 1, y + 1)) \
                  + img.getpixel((x, y - 1)) \
                  + cur_pixel \
                  + img.getpixel((x, y + 1)) \
                  + img.getpixel((x + 1, y - 1)) \
                  + img.getpixel((x + 1, y)) \
                  + img.getpixel((x + 1, y + 1))
            return 9 - sum

def collect_noise_point(img):
    '''收集所有的噪点'''
    noise_point_list = []
    for x in range(img.width):
        for y in range(img.height):
            res_9 = sum_9_region_new(img, x, y)
            if (0 < res_9 < 3) and img.getpixel((x, y)) == 0:  # 找到孤立点
                pos = (x, y)
                noise_point_list.append(pos)
    return noise_point_list

def remove_noise_pixel(img, noise_point_list):
    '''根据噪点的位置信息，消除二值图片的黑点噪声'''
    for item in noise_point_list:
        img.putpixel((item[0], item[1]), 1)

bin = Image.open('checkcode_bin.jpg')
noise_point_list = collect_noise_point(bin)
remove_noise_pixel(bin, noise_point_list)
bin.save('checkcode_fin.jpg')

不过运行完成后任然带噪点，我才反应过来是保存的.jpg格式对图像进行了压缩造成的，不过整体思路没有问题。将代码整合到一起，再保存为.png格式，完整代码如下

from PIL import Image

def sum_9_region_new(img, x, y):
    '''确定噪点 '''
    cur_pixel = img.getpixel((x, y))  # 当前像素点的值
    width = img.width
    height = img.height

    if cur_pixel == 1:  # 如果当前点为白色区域,则不统计邻域值
        return 0

    # 因当前图片的四周都有黑点，所以周围的黑点可以去除
    if y < 3:  # 本例中，前两行的黑点都可以去除
        return 1
    elif y > height - 3:  # 最下面两行
        return 1
    else:  # y不在边界
        if x < 3:  # 前两列
            return 1
        elif x == width - 1:  # 右边非顶点
            return 1
        else:  # 具备9领域条件的
            sum = img.getpixel((x - 1, y - 1)) \
                  + img.getpixel((x - 1, y)) \
                  + img.getpixel((x - 1, y + 1)) \
                  + img.getpixel((x, y - 1)) \
                  + cur_pixel \
                  + img.getpixel((x, y + 1)) \
                  + img.getpixel((x + 1, y - 1)) \
                  + img.getpixel((x + 1, y)) \
                  + img.getpixel((x + 1, y + 1))
            return 9 - sum

def collect_noise_point(img):
    '''收集所有的噪点'''
    noise_point_list = []
    for x in range(img.width):
        for y in range(img.height):
            res_9 = sum_9_region_new(img, x, y)
            if (0 < res_9 < 3) and img.getpixel((x, y)) == 0:  # 找到孤立点
                pos = (x, y)
                noise_point_list.append(pos)
    return noise_point_list

def remove_noise_pixel(img, noise_point_list):
    '''根据噪点的位置信息，消除二值图片的黑点噪声'''
    for item in noise_point_list:
        img.putpixel((item[0], item[1]), 1)

def bin_table(threshold=125):
    '''获取灰度转二值的映射table,0表示黑色,1表示白色'''
    table = []
    for i in range(256):
        if i < threshold:
            table.append(0)
        else:
            table.append(1)
    return table

def main():
    img = Image.open('checkcode.jpg')
    img_grey = img.convert('L')
    table = bin_table()
    bin = img_grey.point(table, '1')
    noise_point_list = collect_noise_point(bin)
    remove_noise_pixel(bin, noise_point_list)
    bin.save('checkcode_fin.png')

if __name__ == '__main__':
    main()

放大以后没有干扰：

结果

import pytesseract
from PIL import Image

im = Image.open('checkcode_bin.png')
string = pytesseract.image_to_string(im)

print(string)