Python_识别弱图片验证码

Java架构师必看

发布于 2021-03-22 11:38:17

7770

发布于 2021-03-22 11:38:17

文章被收录于专栏：Java架构师必看

图片验证码采用加干扰线、字符粘连、字符扭曲方式来增强识别难度，对于以上类型的验证码均不支持。支持的弱验证码如下：

思路： （1）对图片做二值化来降噪处理，去掉图片中的噪点，干扰线，然后将图片中的单个字符切分出来。最后识别每个字符。（2）图片的处理，采用 Python 标准图像处理库 PIL。图片分割，采用谷歌开源库 Tesseract-OCR。字符识别则使用 pytesseract 库。

环境

pip install Pillow
# 如果出现因下载失败导致安装不上的情况，建议使用代理
pip --proxy http://代理ip:端口 install Pillow

Tesseract：开源的OCR识别引擎,在 GitHub 上找到该库并下载。 github 的下载地址

pip install pytesseract

源码demo

from PIL import Image
import pytesseract
'''
获取图片
'''
def getImage(fileName = 'yzm10.png'):
    img = Image.open(fileName)
    # 打印当前图片的模式以及格式
    print('转化前的img: ', img.mode, img.format)
    return img
'''
图片进行降噪处理, 通过二值化去掉后面的背景色并加深文字对比度
'''
def convert_Image(img, standard=127.5):
    # 灰度转换
    image = img.convert('L')
    # 【二值化】根据阈值 standard , 将所有像素都置为 0(黑色) 或 255(白色), 便于接下来的分割
    pixels = image.load()
    for x in range(image.width):
        for y in range(image.height):
            if pixels[x, y] > standard:
                pixels[x, y] = 255
            else:
                pixels[x, y] = 0
    return image
    
'''
使用 pytesseract 库来识别图片中的字符
'''
def change_Image_to_text(img):
    '''
    如果出现找不到训练库的位置, 需要我们手动自动
    语法: tessdata_dir_config = '--tessdata-dir "<replace_with_your_tessdata_dir_path>"'
    '''
    testdata_dir_config = '--tessdata-dir "C:\\Program Files\\Tesseract-OCR\\tessdata"'
    textCode = pytesseract.image_to_string(img, lang='eng', config=testdata_dir_config)
    print("textCode----------->",textCode)
    # 去掉非法字符，只保留字母数字
    #textCode = textCode.sub("\W", "", textCode)
    return textCode    

def main():
    img = convert_Image(getImage()) #降噪处理
    #img = getImage()
    print('识别的结果：', change_Image_to_text(img))

if __name__ == '__main__':
    main()

转载参考博文： python 验证码识别示例（二）复杂验证码识别

本文由来源 jackaroo2020，由 javajgs_com 整理编辑，其版权均为 jackaroo2020 所有，文章内容系作者个人观点，不代表 Java架构师必看对观点赞同或支持。如需转载，请注明文章来源。

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

如有侵权请联系 cloudcommunity@tencent.com 删除

文字识别